# Model Building - LSTM and BiLSTM

Here, the LSTM and the Bidirectional LSTM Recurrent Neural Networks will be used for the training process. Both of them uses the RNN structure in which the training is done in a series of steps where the output of a step will be used as an input for the next step along with the respective input of that step. This helps in maintaining a relationship between the outputs at different steps. Normally the RNNs are used for time series forecasting. But this can also be used for the text analysis. The LSTM and BiLSTM are are certain variations of the regular RNN.

## Import the libraries

In [4]:
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
import codecs
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import re
import sys
import warnings
import pickle
from bs4 import BeautifulSoup
warnings.filterwarnings("ignore")

In [5]:
data = pd.read_csv('cleaned_text2.csv')
data.head()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate,cleaned_text
0,0,0,0,0,0,0,explanation edits made username hardcore metal...
1,0,0,0,0,0,0,daww matches background colour seemingly stuck...
2,0,0,0,0,0,0,hey man really trying edit war guy constantly ...
3,0,0,0,0,0,0,make real suggestions improvement wondered sec...
4,0,0,0,0,0,0,sir hero chance remember page


In [6]:
data.shape

(159571, 7)

In [46]:
labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   toxic          159571 non-null  int64 
 1   severe_toxic   159571 non-null  int64 
 2   obscene        159571 non-null  int64 
 3   threat         159571 non-null  int64 
 4   insult         159571 non-null  int64 
 5   identity_hate  159571 non-null  int64 
 6   cleaned_text   159471 non-null  object
dtypes: int64(6), object(1)
memory usage: 8.5+ MB


In [9]:
data[data['obscene'] == 0]

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate,cleaned_text
0,0,0,0,0,0,0,explanation edits made username hardcore metal...
1,0,0,0,0,0,0,daww matches background colour seemingly stuck...
2,0,0,0,0,0,0,hey man really trying edit war guy constantly ...
3,0,0,0,0,0,0,make real suggestions improvement wondered sec...
4,0,0,0,0,0,0,sir hero chance remember page
...,...,...,...,...,...,...,...
159566,0,0,0,0,0,0,second time asking view completely contradicts...
159567,0,0,0,0,0,0,ashamed horrible thing put talk page
159568,0,0,0,0,0,0,spitzer umm theres actual article prostitution...
159569,0,0,0,0,0,0,looks like actually put speedy first version d...


In [10]:
data[(data['obscene']==1) & (data['toxic'] == 1)].shape

(7926, 7)

In [11]:
data[(data['insult']==1) & (data['toxic'] == 0)].shape

(533, 7)

In [12]:
data[(data['identity_hate']==1) & (data['toxic'] == 0)].shape

(103, 7)

In [13]:
data[data['threat']==1].head()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate,cleaned_text
79,1,0,0,1,0,0,hi back last warning stop undoing edits die
176,1,0,1,1,1,1,think fagget get oife burn hell hate sorry can...
600,1,0,0,1,0,0,also sock puppet account suprise sincerely man...
802,1,0,1,1,1,0,fuck smith please notified die want dance grave
1017,1,1,1,1,1,1,wouldnt first time bitch fuck ill find live so...


In [14]:
data['cleaned_text'] = data['cleaned_text'].apply(lambda x: str(x))

In [15]:
train, test = train_test_split(data, test_size=0.2)
train.head()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate,cleaned_text
96090,0,0,0,0,0,0,support thanks hard work checked every detail ...
63180,0,0,0,0,0,0,redirect talk celebrity big brother
34659,0,0,0,0,0,0,want company name history presented wikipedia
63397,0,0,0,0,0,0,thanks looks useful honest think time knowledg...
31191,1,0,0,1,1,1,white racist white girl makes think talk anoth...


## Tokenization
For LSTM and BiLSTM another tokenization and word vectorization is used which is different from the ones used for the regular models that were built.

In [16]:
MAX_SEQUENCE_LENGTH = 400
MAX_NB_WORDS = 50000

In [47]:
tokenizer=Tokenizer(lower=False, filters='',num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(train['cleaned_text'])

sequences = tokenizer.texts_to_sequences(train['cleaned_text'])
test_sequences = tokenizer.texts_to_sequences(test['cleaned_text'])

train_data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

print('Shape of train data tensor:', train_data.shape)

test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
nb_words = (np.max(train_data) + 1)

Shape of train data tensor: (127656, 400)


In [18]:
import pickle
with open('tokenizer_lstm.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [19]:
print(nb_words)

50000


## LSTM Model

In [20]:
from keras.layers import LSTM

In [48]:
inp = Input(shape=(MAX_SEQUENCE_LENGTH, ))
# size of the vector space
embed_size = 128
x = Embedding(nb_words, embed_size)(inp)
output_dimention = 60
x = LSTM(output_dimention, return_sequences=True,name='lstm_layer')(x)
# reduce dimention
x = GlobalMaxPool1D()(x)
# disable 10% precent of the nodes
x = Dropout(0.1)(x)
# pass output through a RELU function
x = Dense(50, activation="relu")(x)
# another 10% dropout
x = Dropout(0.1)(x)
# pass the output through a sigmoid layer, since 
# we are looking for a binary (0,1) classification
x = Dense(6, activation="sigmoid")(x)

model = Model(inputs=inp, outputs=x)

model.summary()
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 400)]             0         
                                                                 
 embedding_1 (Embedding)     (None, 400, 128)          6400000   
                                                                 
 lstm_layer (LSTM)           (None, 400, 60)           45360     
                                                                 
 global_max_pooling1d_1 (Glo  (None, 60)               0         
 balMaxPooling1D)                                                
                                                                 
 dropout_2 (Dropout)         (None, 60)                0         
                                                                 
 dense_2 (Dense)             (None, 50)                3050      
                                                           

In [49]:
y = train[labels].values

In [50]:
y

array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       ...,
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]], dtype=int64)

In [51]:
model.fit(train_data,y, batch_size=32, epochs=2, validation_split=0.1)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1a59eb9ba90>

In [52]:
from sklearn.metrics import accuracy_score, hamming_loss, log_loss
def evaluate_score(Y_test,predict): 
    accuracy = accuracy_score(Y_test,predict)
    print("Accuracy : {}".format(accuracy*100))
    print(f'Hamming Loss : {hamming_loss(Y_test, predict)}')
    try : 
        loss = log_loss(Y_test,predict)
    except :
        loss = log_loss(Y_test,predict.toarray())
    print("Log_loss : {}".format(loss))
    return accuracy,loss

In [53]:
pred = model.predict(test_data)
pred[:10]



array([[8.13038845e-04, 2.74493942e-07, 3.21314837e-05, 2.97820202e-06,
        1.21085184e-04, 9.38685389e-06],
       [9.93320286e-01, 1.88148811e-01, 9.61634099e-01, 2.57386416e-02,
        7.69616425e-01, 9.87762213e-02],
       [2.44177748e-02, 7.25165155e-05, 1.63619174e-03, 5.83332439e-04,
        3.94833507e-03, 6.91708352e-04],
       [1.25453772e-03, 1.18742480e-06, 5.81385975e-05, 1.25671186e-05,
        2.16187982e-04, 2.45843803e-05],
       [7.10039914e-01, 2.90427054e-03, 1.55623555e-01, 5.39087411e-03,
        2.83621013e-01, 1.80782434e-02],
       [5.13999909e-03, 6.88245655e-06, 2.41900023e-04, 7.88985708e-05,
        8.41708214e-04, 1.05906271e-04],
       [2.73356144e-03, 1.18653611e-06, 9.57825687e-05, 1.57736704e-05,
        3.62441147e-04, 3.21693769e-05],
       [9.85717177e-01, 2.25447029e-01, 9.35743093e-01, 5.58339618e-02,
        7.93467164e-01, 1.84385374e-01],
       [5.77867553e-02, 2.03859763e-05, 1.90499832e-03, 1.95712026e-04,
        5.56305982e-03, 

In [54]:
y_test = test[labels].values
y_test

array([[0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 1, 0],
       [0, 0, 0, 0, 0, 0],
       ...,
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]], dtype=int64)

In [55]:
pred

array([[8.13038845e-04, 2.74493942e-07, 3.21314837e-05, 2.97820202e-06,
        1.21085184e-04, 9.38685389e-06],
       [9.93320286e-01, 1.88148811e-01, 9.61634099e-01, 2.57386416e-02,
        7.69616425e-01, 9.87762213e-02],
       [2.44177748e-02, 7.25165155e-05, 1.63619174e-03, 5.83332439e-04,
        3.94833507e-03, 6.91708352e-04],
       ...,
       [1.19458139e-02, 1.03232451e-05, 1.26814982e-03, 4.16561707e-05,
        1.96341448e-03, 2.02212526e-04],
       [6.16130070e-04, 2.26648822e-07, 2.09067566e-05, 2.60966021e-06,
        8.46088515e-05, 6.25734401e-06],
       [3.34688026e-04, 1.08087853e-07, 1.22838510e-05, 1.10674159e-06,
        4.93516673e-05, 3.49544007e-06]], dtype=float32)

In [56]:
lstm_acc, lstm_loss = evaluate_score(y_test, pred.round())

Accuracy : 92.01942660191132
Hamming Loss : 0.01732727557574808
Log_loss : 0.7293063023457708


In [58]:
model.save('LSTM_toxic_prediction_model3.h5')

## BiLSTM Model

In [63]:
inp = Input(shape=(MAX_SEQUENCE_LENGTH, ))
# size of the vector space
embed_size = 128
bi = Embedding(nb_words, embed_size)(inp)
output_dimention = 60
bi = Bidirectional(LSTM(output_dimention, return_sequences=True,name='lstm_layer'))(bi)
# reduce dimention
bi = GlobalMaxPool1D()(bi)
# disable 10% precent of the nodes
bi = Dropout(0.1)(bi)
# pass output through a RELU function
bi = Dense(50, activation="relu")(bi)
# another 10% dropout
bi = Dropout(0.1)(bi)
# pass the output through a sigmoid layer, since 
# we are looking for a binary (0,1) classification
bi = Dense(6, activation="sigmoid")(bi)

model_bi = Model(inputs=inp, outputs=bi)

model_bi.summary()
model_bi.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 400)]             0         
                                                                 
 embedding_2 (Embedding)     (None, 400, 128)          6400000   
                                                                 
 bidirectional (Bidirectiona  (None, 400, 120)         90720     
 l)                                                              
                                                                 
 global_max_pooling1d_2 (Glo  (None, 120)              0         
 balMaxPooling1D)                                                
                                                                 
 dropout_4 (Dropout)         (None, 120)               0         
                                                                 
 dense_4 (Dense)             (None, 50)                6050

In [64]:
model_bi.fit(train_data,y, batch_size=32, epochs=2, validation_split=0.1)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x1a5a3df3130>

In [65]:
y_pred_bi = model_bi.predict(test_data)



In [66]:
bilstm_acc, bilstm_loss = evaluate_score(y_test, y_pred_bi.round())

Accuracy : 91.82202725990913
Hamming Loss : 0.017755496370567652
Log_loss : 0.7425914205369538


In [67]:
model_bi.save('BiLSTM_toxic_prediction_model3.h5')

**Note:** We are storing our model for future use

In [81]:
'''##########[!] Functions to be used###############'''
def remove_contractions(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

def remove_punctuations(sentence): #function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    cleaned = cleaned.strip()
    cleaned = cleaned.replace("\n"," ")
    return cleaned

def clean_sentences(sentence):
    sentence = str(sentence)
    sentence= re.sub(r"http\S+", "", sentence)
    sentence = BeautifulSoup(sentence, 'lxml').get_text()
    sentence = remove_contractions(sentence)
    sentence = remove_punctuations(sentence)
    sentence = re.sub("\S*\d\S*", "", sentence).strip()
    sentence = re.sub('[^A-Za-z]+', ' ', sentence)
    stop_words = set(stopwords.words('english'))
    stop_words.update(['zero','one','two','three','four','five','six','seven','eight','nine','ten','may','also','across','among','beside','however','yet','within'])
    sentence = ' '.join(e.lower() for e in sentence.split() if e.lower() not in  stopwords.words('english'))
    return sentence.strip()

def tokenize(sentence):
    MAX_SEQUENCE_LENGTH = 400
    #MAX_NB_WORDS = 50000
    with open('tokenizer_lstm.pickle', 'rb') as handle:
                    tokenizer = pickle.load(handle)
    test_sequences = tokenizer.texts_to_sequences([sentence])
    test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
    return test_data

def model_predict(model, test_data):
    from keras.models import load_model
    if model == 'lstm':
        model=load_model('LSTM_toxic_prediction_model2.h5')
    elif model == 'bilstm':
        model = load_model('BiLSTM_toxic_prediction_model.h5')
    prediction=model.predict(test_data)    
    return prediction

def get_prediction(model, sentence):
    clear_text=clean_sentences(sentence)
    test_data=tokenize(clear_text)
    predicted_array=model_predict(model, test_data)
    #'identity_hate', 'insult', 'obscene', 'severe_toxic', 'threat', 'toxic'
    predicted_values={'Toxic':round(predicted_array[0][0]),'Severe Toxic':round(predicted_array[0][1]), 'Obscene':round(predicted_array[0][2]), 'Threat':round(predicted_array[0][3]), 'Insult':round(predicted_array[0][4]), 'Hatred':round(predicted_array[0][5])}
    #print(clear_text)
    #print(test_data)
    print(predicted_array[0])
    result_list=[]
    for key in predicted_values:
        #print(key)
        #print(predicted_values[key])
        if(predicted_values[key]==1.0):
            result_list.append(key)
    result = (',').join(result_list)
    if result == 'Toxic,Severe Toxic,Obscene,Insult':
        result_cat = 'Intense Malicious Disparagement'
        toxic_per = 90
    elif result == 'Toxic,Obscene,Insult,Hatred':
        result_cat = 'Venomous Reprehension'
        toxic_per = 80
    elif result == 'Toxic,Obscene,Insult':
        result_cat = 'Malicious Indecency'
        toxic_per = 70
    elif result == 'Toxic,Obscene':
        result_cat = 'Noxious'
        toxic_per = 60
    elif result == 'Toxic,Threat':
        result_cat = 'Menacing'
        toxic_per = 50
    elif result == 'Toxic,Insult':
        result_cat = 'Offensive'
        toxic_per = 50
    elif result == 'Toxic,Obscene,Threat,Insult,Hatred':
        result_cat = 'Malevolent Vulgarity'
        toxic_per = 95
    elif result == 'Toxic,Severe Toxic,Obscene':
        result_cat = 'Intense Contamination'
        toxic_per = 80
    elif result == 'Toxic,Obscene,Threat,Insult':
        result_cat = 'Dangerous Provocation'
        toxic_per = 85
    elif result == 'Toxic,Severe Toxic,Obscene,Insult,Hatred':
        result_cat = 'Excessive Malevolence'
        toxic_per = 90
    elif result == 'Toxic,Severe Toxic,Obscene,Threat,Insult,Hatred':
        result_cat = 'Overwhelming Hostility'
        toxic_per = 95
    elif result == 'Toxic,Insult,Hatred':
        result_cat = 'Hostile Disdain'
        toxic_per = 70
    elif result == 'Toxic,Hatred':
        result_cat = 'Virulent Hostility'
        toxic_per = 60
    elif result == 'Obscene,Insult':
        result_cat = 'Vulgar Reproach'
        toxic_per = 70
    elif result == 'Toxic,Severe Toxic,Obscene,Hatred':
        result_cat = 'Excessive Hostility'
        toxic_per = 85
    elif result == 'Toxic,Severe Toxic,Obscene,Threat,Insult':
        result_cat = 'Overwhelming Menace'
        toxic_per = 90
    elif result == 'Toxic,Obscene,Hatred':
        result_cat = 'Harmful Repugnance'
        toxic_per = 75
    elif result == 'Obscene,Insult,Hatred':
        result_cat = 'Indecent Contempt'
        toxic_per = 80
    elif result == 'Toxic,Severe Toxic':
        result_cat = 'Intense Toxicity'
        toxic_per = 60
    elif result == 'Toxic,Severe Toxic,Insult':
        result_cat = 'Severe Disdain'
        toxic_per = 70
    elif result == 'Toxic,Severe Toxic,Hatred':
        result_cat = 'Intense Malevolence'
        toxic_per = 80
    elif result == 'Obscene,Threat':
        result_cat = 'Offensive Threat'
        toxic_per = 50
    elif result == 'Toxic,Threat,Insult':
        result_cat = 'Hazardous Assault'
        toxic_per = 60
    elif result == 'Insult,Hatred':
        result_cat = 'Offensive Animosity'
        toxic_per = 60
    elif result == 'Toxic,Severe Toxic,Obscene,Threat':
        result_cat = 'Overwhelming Peril'
        toxic_per = 90
    elif result == 'Toxic,Severe Toxic,Insult,Hatred':
        result_cat = 'Intense Hostility'
        toxic_per = 90
    elif result == 'Toxic,Obscene,Threat':
        result_cat = 'Hazardous Provocation'
        toxic_per = 80
    elif result == 'Toxic,Severe Toxic,Threat':
        result_cat = 'Severe Menace'
        toxic_per = 85
    elif result == 'Threat,Insult':
        result_cat = 'Threatening Disdain'
        toxic_per = 50
    elif result == 'Toxic,Threat,Insult,Hatred':
        result_cat = 'Perilous Animosity'
        toxic_per = 75
    elif result == 'Toxic,Threat,Hatred':
        result_cat = 'Dangerous Hostility'
        toxic_per = 80
    elif result == 'Obscene,Threat,Insult':
        result_cat = 'Indecent Threat'
        toxic_per = 70
    elif result == 'Toxic,Severe Toxic,Threat,Insult':
        result_cat = 'Intense Disparagement'
        toxic_per = 85
    elif result == 'Toxic,Severe Toxic,Threat,Hatred':
        result_cat = 'Overwhelming Malevolence'
        toxic_per = 90
    elif result == 'Obscene,Hatred':
        result_cat = 'Indecent Animosity'
        toxic_per = 60
    elif result == 'General':
        result_cat = 'Harmless or Positive Commentary'
        toxic_per = 0
    elif result == 'Toxic':
        result_cat = 'Toxic'
        toxic_per = 40
    elif result == 'Severe Toxic':
        result_cat = 'Severe Toxic'
        toxic_per = 50
    elif result == 'Obscene':
        result_cat = 'Obscene'
        toxic_per = 50
    elif result == 'Threat':
        result_cat = 'Threat'
        toxic_per = 50
    elif result == 'Insult':
        result_cat = 'Insult'
        toxic_per = 50
    elif result == 'Hatred':
        result_cat = 'Hatred'
        toxic_per = 60
#     print(result)
#     print(result_cat)
#     print(toxic_per)
    return result, result_cat, toxic_per
    

In [78]:
','.join(['Toxic', 'Insult'])

'Toxic,Insult'

In [61]:
get_prediction('lstm', "COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK")

[0.99742466 0.28200847 0.9733534  0.03681181 0.8689683  0.15827034]
Toxic Obscene Insult 


In [45]:
get_prediction("Hello")




In [46]:
get_prediction("BAstard")

Hate Obscene Severe_Toxic 


In [62]:
get_prediction('lstm', 'Fuck OFF man , you peace of cunt. Mother fucker')

[0.99942267 0.51024634 0.9938762  0.04907466 0.93890274 0.25359315]
Toxic Severe_Toxic Obscene Insult 


--------------

Both the LSTM and the BiLSTM models are trained here, and both of them gave very similar results. And the predictions are very accurate. So, these models can be finalized for the deployment.