<a href="https://colab.research.google.com/github/Eashan123/BERT/blob/master/multilabel_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Why multilabel classification:
multi-label classification assumes that a document can simultaneously and independently assigned to multiple labels or classes. Multi-label classification has many real world applications such as categorising businesses or assigning multiple genres to a movie. In the world of customer service, this technique can be used to identify multiple intents for a customer’s email. <br>
****Why is it a better alternative than RNN?**** <br>
BERT is a multilingual transformer based model that has achieved state-of-the-art results on various NLP tasks. BERT is a bidirectional model that is based on the transformer architecture, it replaces the sequential nature of RNN (LSTM & GRU) with a much faster Attention-based approach. The model is also pre-trained on two unsupervised tasks, masked language modeling and next sentence prediction.

In [0]:
# #https://github.com/strongio/keras-bert/blob/master/keras-bert.ipynb

In [0]:
# !pip install bert

In [0]:
# !pip install keras-bert

In [0]:
!pip install bert-tensorflow

Collecting bert-tensorflow
[?25l  Downloading https://files.pythonhosted.org/packages/a6/66/7eb4e8b6ea35b7cc54c322c816f976167a43019750279a8473d355800a93/bert_tensorflow-1.0.1-py2.py3-none-any.whl (67kB)
[K     |████▉                           | 10kB 27.3MB/s eta 0:00:01[K     |█████████▊                      | 20kB 31.7MB/s eta 0:00:01[K     |██████████████▋                 | 30kB 37.9MB/s eta 0:00:01[K     |███████████████████▍            | 40kB 40.2MB/s eta 0:00:01[K     |████████████████████████▎       | 51kB 41.3MB/s eta 0:00:01[K     |█████████████████████████████▏  | 61kB 43.3MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 10.7MB/s 
Installing collected packages: bert-tensorflow
Successfully installed bert-tensorflow-1.0.1


In [0]:
!pip install -U -q PyDrive

In [0]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [0]:
#We are mounting drive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
# Importing Packages

import tensorflow as tf
import pandas as pd
import tensorflow_hub as hub
import os
import re
import numpy as np
from bert.tokenization import FullTokenizer
from tqdm import tqdm_notebook
from tensorflow.keras import backend as K
# from tensorflow.keras.callbacks import callback

In [0]:
# import tqdm.keras as tk
# # from tqdm.keras import TqdmCallback
# from keras.models import model_from_yaml
from tensorflow.keras.models import load_model
from bert import run_classifier
from bert import optimization
from bert import tokenization




In [0]:
# Initialize session
sess = tf.Session()
bert_path = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1" #lower case model, the smallest in size.

In [0]:
# Params for bert model and tokenization
# Variable configuration

max_seq_length = 100
# EMBEDDING_DIM = 50 # could be 100/150/200
VALIDATION_SPLIT = 0.15
BATCH_SIZE = 200
EPOCHS = 1 #to save time
possible_labels = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

In [0]:
DATA_COLUMN = 'text'
LABEL_COLUMN = 'labels'

In [0]:
link = 'https://drive.google.com/open?id=1a9V_WyDXpgTg_nHyI5dRNIOI3SSQrScP' 

In [0]:
link_unknown_data = 'https://drive.google.com/open?id=1qjkuI_3uOBKNyowTMv3a5Xf-aJWbOldz'

In [0]:
fluff, id = link.split('=')

In [0]:
fluff_p, id_p = link_unknown_data.split('=')

In [0]:
print (id) # Verify that you have everything after '='

1qjkuI_3uOBKNyowTMv3a5Xf-aJWbOldz


In [0]:
downloaded = drive.CreateFile({'id':id}) 

In [0]:
downloaded_p = drive.CreateFile({'id':id_p}) 

In [0]:
downloaded.GetContentFile('Filename.csv')  
train_df = pd.read_csv('Filename.csv')

In [0]:
downloaded_p.GetContentFile('Filename.xlsx')  
p_dat = pd.read_excel('Filename.xlsx')

### Data
First, we load the sample data Statement Toxicity data

In [0]:
# train_df = pd.read_csv('/BERT/data/train.csv')
trn_df = train_df.sample(frac=0.01,random_state = 42) 
# trn_df = train_df
trn_df.shape

(1596, 8)

In [0]:
 trn_df.head(3)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
119105,7ca72b5b9c688e9e,"Geez, are you forgetful! We've already discus...",0,0,0,0,0,0
131631,c03f72fd8f8bf54f,Carioca RFA \n\nThanks for your support on my ...,0,0,0,0,0,0
125326,9e5b8e8fc1ff2e84,"""\n\n Birthday \n\nNo worries, It's what I do ...",0,0,0,0,0,0


#### Adding a new column as per the requirement of bert

In [0]:
trn_df['labels'] = list(zip(trn_df.toxic.tolist(), trn_df.severe_toxic.tolist(), trn_df.obscene.tolist(), trn_df.threat.tolist(),  trn_df.insult.tolist(), trn_df.identity_hate.tolist()))

In [0]:
# Some text processing

In [0]:
trn_df['text'] = trn_df['comment_text'].apply(lambda x: x.replace('\n', ' '))

In [0]:
trn_df.head(3)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,labels,text
119105,7ca72b5b9c688e9e,"Geez, are you forgetful! We've already discus...",0,0,0,0,0,0,"(0, 0, 0, 0, 0, 0)","Geez, are you forgetful! We've already discus..."
131631,c03f72fd8f8bf54f,Carioca RFA \n\nThanks for your support on my ...,0,0,0,0,0,0,"(0, 0, 0, 0, 0, 0)",Carioca RFA Thanks for your support on my re...
125326,9e5b8e8fc1ff2e84,"""\n\n Birthday \n\nNo worries, It's what I do ...",0,0,0,0,0,0,"(0, 0, 0, 0, 0, 0)",""" Birthday No worries, It's what I do ;)En..."


#### Converting data to array for the input to the model

In [0]:
train_text = trn_df['text'].tolist()
train_text = [' '.join(t.split()[0:max_seq_length]) for t in train_text]
train_text = np.array(train_text, dtype=object)[:, np.newaxis]

In [0]:
train_label = trn_df['labels'].values

In [0]:
print('train_text shape -> {} \ntrain_text[0] -> {} \ntrain label shape -> {} \ntrain label -> {} \n'.format(train_text.shape, train_text[0], train_label.shape, train_label[0]))
# print('test_text shape -> {} \ntest_text[0] -> {} \ntrain label shape -> {} \n'.format(test_text.shape, test_text[0], test_label.shape))

train_text shape -> (1596, 1) 
train_text[0] -> ["Geez, are you forgetful! We've already discussed why Marx was not an anarchist, i.e. he wanted to use a State to mold his 'socialist man.' Ergo, he is a statist - the opposite of an anarchist. I know a guy who says that, when he gets old and his teeth fall out, he'll quit eating meat. Would you call him a vegetarian?"] 
train label shape -> (1596,) 
train label -> (0, 0, 0, 0, 0, 0) 



### Tokenize

***What are we doing here?*** <br>
Because BERT is a pretrained model that expects input data in a specific format, we will need: <br>
a. special tokens to mark the beginning ([CLS]) and separation/end of sentences ([SEP])  -> refer convert_text_to_examples(), InputExample(object) <br>
b. tokens that conforms with the fixed vocabulary used in BERT -> Convert Single Example() <br>
c. token IDs from BERT’s tokenizer -> create_tokenizer_from_hub_module() <br>
d. mask IDs to indicate which elements in the sequence are tokens and which are padding elements -> convert_examples_to_features(), convert_single_example() <br>
e. segment IDs used to distinguish different sentences -> convert_examples_to_features(), convert_single_example() <br>
f. positional embeddings used to show token position with in the sequence (input_ids) -> convert_examples_to_features(), convert_single_example() <br>

***Tokenizer example for a better understanding***

In [0]:
# #Demo

# text = "Here is the sentence I want embeddings for."
# marked_text = "[CLS] " + text + " [SEP]"

# # Tokenize our sentence with the BERT tokenizer.
# tokenized_text = tokenizer.tokenize(marked_text)

# # Print out the tokens.
# print (tokenized_text)

**After breaking the text into tokens, we then have to convert the sentence from a list of strings to a list of vocabulary indeces.**

In [0]:
# # Define a new example sentence with multiple meanings of the word "bank"
# text = "He went to the prison cell with his cell phone to extract blood cell samples from inmates"

# # Add the special tokens.
# marked_text = "[CLS] " + text + " [SEP]"

# # Split the sentence into tokens.
# tokenized_text = tokenizer.tokenize(marked_text)

# # Map the token strings to their vocabulary indeces.
# indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

# # Display the words with their indeces.
# for tup in zip(tokenized_text, indexed_tokens):
# #     print('{:<12} {:>6,}'.format(tup[0], tup[1]))
#     print('text -> {:<12} input_ids -> {:>6,}'.format(tup[0], tup[1]))

****Segment ID****
BERT expects 1s and 0s to distinguish between the two sentences. That is, for each token in “tokenized_text,” we must specify which sentence it belongs to: sentence 0 (a series of 0s) or sentence 1 (a series of 1s). <br>

For example, single-sentence inputs only require a series of 1s, so we will create a vector of 1s for each token in our input sentence.

In [0]:
#  #Mark each of the 22 tokens as belonging to sentence "1".
# segments_ids = [1] * len(tokenized_text)

# print (segments_ids)

In [0]:
def create_tokenizer_from_hub_module():
    """Get the vocab file and casing info from the Hub module."""
    bert_module =  hub.Module(bert_path)
    tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
    vocab_file, do_lower_case = sess.run(
        [
            tokenization_info["vocab_file"],
            tokenization_info["do_lower_case"],
        ]
    )

    return FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)


In [0]:
class PaddingInputExample(object):
    """Fake example so the num input examples is a multiple of the batch size.
  When running eval/predict on the TPU, we need to pad the number of examples
  to be a multiple of the batch size, because the TPU requires a fixed batch
  size. The alternative is to drop the last batch, which is bad because it means
  the entire output data won't be generated.
  We use this class instead of `None` because treating `None` as padding
  battches could cause silent errors.
  """

class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None):
        """Constructs a InputExample.
    Args:
      guid: Unique id for the example.
      text_a: string. The untokenized text of the first sequence. For single
        sequence tasks, only this sequence must be specified.
      text_b: (Optional) string. The untokenized text of the second sequence.
        Only must be specified for sequence pair tasks.
      label: (Optional) string. The label of the example. This should be
        specified for train and dev examples, but not for test examples.
    """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label


def convert_single_example(tokenizer, example, max_seq_length=100):
    """Converts a single `InputExample` into a single `InputFeatures`."""

    if isinstance(example, PaddingInputExample):
        input_ids = [0] * max_seq_length
        input_mask = [0] * max_seq_length
        segment_ids = [0] * max_seq_length
        label = 0
        return input_ids, input_mask, segment_ids, label

    tokens_a = tokenizer.tokenize(example.text_a)
    if len(tokens_a) > max_seq_length - 2:
        tokens_a = tokens_a[0 : (max_seq_length - 2)]

    tokens = []
    segment_ids = []
    tokens.append("[CLS]")
    segment_ids.append(0)
    for token in tokens_a:
        tokens.append(token)
        segment_ids.append(0)
    tokens.append("[SEP]")
    segment_ids.append(0)

    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # The mask has 1 for real tokens and 0 for padding tokens. Only real
    # tokens are attended to.
    input_mask = [1] * len(input_ids)

    # Zero-pad up to the sequence length.
    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)

    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length

    return input_ids, input_mask, segment_ids, example.label

def convert_examples_to_features(tokenizer, examples, max_seq_length=100): ### calls convert single example
    """Convert a set of `InputExample`s to a list of `InputFeatures`."""

    input_ids, input_masks, segment_ids, labels = [], [], [], []
    for example in tqdm_notebook(examples, desc="Converting examples to features"):
        input_id, input_mask, segment_id, label = convert_single_example(
            tokenizer, example, max_seq_length
        )
        input_ids.append(input_id)
        input_masks.append(input_mask)
        segment_ids.append(segment_id)
        labels.append(label)
    return (
        np.array(input_ids),
        np.array(input_masks),
        np.array(segment_ids),
        np.array(labels),
#         np.array(labels).reshape(-1, 1),
        
    )

def convert_text_to_examples(texts, labels):
    """Create InputExamples"""
    InputExamples = []
    for text, label in zip(texts, labels):
        InputExamples.append(
            InputExample(guid=None, text_a=" ".join(text), text_b=None, label=label)
        )
    return InputExamples

***Instantiate tokenizer***

In [0]:
tokenizer = create_tokenizer_from_hub_module()

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore








***Convert our data to Bert format***

In [0]:
train_examples = convert_text_to_examples(train_text, train_label)

In [0]:
type(train_examples), train_examples[0]

(list, <__main__.InputExample at 0x7fe80ed847f0>)

In [0]:
# Convert to features
(train_input_ids, train_input_masks, train_segment_ids, train_labels 
) = convert_examples_to_features(tokenizer, train_examples, max_seq_length=max_seq_length)


HBox(children=(IntProgress(value=0, description='Converting examples to features', max=1596, style=ProgressSty…




In [0]:
type(train_input_ids), train_input_ids.shape, train_input_masks.shape, train_segment_ids.shape, train_labels.shape

(numpy.ndarray, (1596, 100), (1596, 100), (1596, 100), (1596, 6))

#### Model

In [0]:
class BertLayer(tf.keras.layers.Layer):
    def __init__(
        self,
        n_fine_tune_layers=10,
        pooling="first",
        bert_path="https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1", #This is the model we choose
        **kwargs,
    ):
        self.n_fine_tune_layers = n_fine_tune_layers
        self.trainable = True
        self.output_size = 768
        self.pooling = pooling
        self.bert_path = bert_path
        if self.pooling not in ["first", "mean"]:
            raise NameError(
                f"Undefined pooling type (must be either first or mean, but is {self.pooling}"
            )

        super(BertLayer, self).__init__(**kwargs)

    def get_config(self):

        config = super().get_config().copy()
        config.update({
            'n_fine_tune_layers': self.n_fine_tune_layers,
            'trainable': self.trainable,
            'output_size': self.output_size,
            'pooling': self.pooling,
            'bert_path': self.bert_path,
        })
        return config
    

    def build(self, input_shape):
        self.bert = hub.Module(
            self.bert_path, trainable=self.trainable, name=f"{self.name}_module"
        )

        # Remove unused layers
        trainable_vars = self.bert.variables
        if self.pooling == "first":
            trainable_vars = [var for var in trainable_vars if not "/cls/" in var.name]
            trainable_layers = ["pooler/dense"]

        elif self.pooling == "mean":
            trainable_vars = [
                var
                for var in trainable_vars
                if not "/cls/" in var.name and not "/pooler/" in var.name
            ]
            trainable_layers = []
        else:
            raise NameError(
                f"Undefined pooling type (must be either first or mean, but is {self.pooling}"
            )

        # Select how many layers to fine tune
        for i in range(self.n_fine_tune_layers):
            trainable_layers.append(f"encoder/layer_{str(11 - i)}")

        # Update trainable vars to contain only the specified layers
        trainable_vars = [
            var
            for var in trainable_vars
            if any([l in var.name for l in trainable_layers])
        ]

        # Add to trainable weights
        for var in trainable_vars:
            self._trainable_weights.append(var)

        for var in self.bert.variables:
            if var not in self._trainable_weights:
                self._non_trainable_weights.append(var)

        super(BertLayer, self).build(input_shape)

    def call(self, inputs):
        inputs = [K.cast(x, dtype="int32") for x in inputs]
        input_ids, input_mask, segment_ids = inputs
        bert_inputs = dict(
            input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids
        )
        if self.pooling == "first":
            pooled = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)[
                "pooled_output"
            ]
        elif self.pooling == "mean":
            result = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)[
                "sequence_output"
            ]

            mul_mask = lambda x, m: x * tf.expand_dims(m, axis=-1)
            masked_reduce_mean = lambda x, m: tf.reduce_sum(mul_mask(x, m), axis=1) / (
                    tf.reduce_sum(m, axis=1, keepdims=True) + 1e-10)
            input_mask = tf.cast(input_mask, tf.float32)
            pooled = masked_reduce_mean(result, input_mask)
        else:
            raise NameError(f"Undefined pooling type (must be either first or mean, but is {self.pooling}")

        return pooled

    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.output_size)

****Defining our model****

In [0]:
# Build model
def build_model(max_seq_length): 
    in_id = tf.keras.layers.Input(shape=(max_seq_length,), name="input_ids")
    in_mask = tf.keras.layers.Input(shape=(max_seq_length,), name="input_masks")
    in_segment = tf.keras.layers.Input(shape=(max_seq_length,), name="segment_ids")
    
    #This is the input in list form to be fed to the model
    bert_inputs = [in_id, in_mask, in_segment]
    
    bert_output = BertLayer(n_fine_tune_layers=3, pooling="first")(bert_inputs) #calling the preloaded BERT model we have installed
    
    dense = tf.keras.layers.Dense(256, activation='relu')(bert_output) # Attaching our model output here
    pred = tf.keras.layers.Dense(len(possible_labels), activation='sigmoid')(dense)
    
    model = tf.keras.models.Model(inputs=bert_inputs, outputs=pred)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    
    return model

In [0]:
def initialize_vars(sess):
    sess.run(tf.local_variables_initializer())
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    K.set_session(sess)

In [0]:
modll_ = build_model(max_seq_length)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 100)]        0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 100)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 100)]        0                                            
__________________________________________________________________________________________________
bert_layer (BertLayer)          (None, 768)          110104890   input_ids[0][0]                  
                                                                 input_masks[0][0]            

In [0]:
# Instantiate variables
initialize_vars(sess)

In [0]:
train_input_ids.shape, train_input_masks.shape, train_segment_ids.shape, train_labels.shape 

((1596, 100), (1596, 100), (1596, 100), (1596, 6))

In [0]:
#Defining NBatchLogger for logging details for training
class NBatchLogger(tf.keras.callbacks.Callback):
    def __init__(self, display):
        self.seen = 0
        self.display = display

    def on_batch_end(self, batch, logs={}):
        self.seen += logs.get('size', 0)
        if self.seen % self.display == 0:
            metrics_log = ''
            for k in self.params['metrics']:
                if k in logs:
                    val = logs[k]
                    if abs(val) > 1e-3:
                        metrics_log += ' - %s: %.4f' % (k, val)
                    else:
                        metrics_log += ' - %s: %.4e' % (k, val)
            print('{}/{} ... {}'.format(self.seen,
                                        self.params['samples'],
                                        metrics_log))

****Transfer Learning****

In [0]:
out_batch = NBatchLogger(display=1000)

In [0]:
modll_.fit([train_input_ids, train_input_masks, train_segment_ids], train_labels, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_split=VALIDATION_SPLIT, callbacks=[out_batch])

#### Saving Model

Not working, needs some changes in it's original run config file, due to the different architecture followed by bert.

In [0]:
# # #refer: --> https://stackoverflow.com/questions/58678836/notimplementederror-layers-with-arguments-in-init-must-override-get-conf

modll_.save('/content/drive/My Drive/BertModel.h5')

# # #or

# # # serialize model to YAML
# model_yaml = modl_.to_yaml()
# with open("model.yaml", "w") as yaml_file:
#     yaml_file.write(model_yaml)
# # serialize weights to HDF5
# modl_.save_weights("bert_model_keras.h5")
# print("Saved model to disk") 

In [0]:
# # predictions before we clear and reload model
pre_save_preds = modll_.predict([test_input_ids[0:100], 
                                test_input_masks[0:100], 
                                test_segment_ids[0:100]]
                              ) 

In [0]:
# # A file ID looks like: laggVyWshwcyP6kEI-y_W3P8D26sz
# listed = drive.ListFile({'q': "title contains '.h5' and 'root' in parents"}).GetList()
# for file in listed:
#     print('title {}, id {}'.format(file['title'], file['id']))

In [0]:
# Clear and load model
model = None
model = build_model(max_seq_length)
# initialize_vars(sess)
# model.load_weights('BertModel.h5')
model.load_weights('/content/drive/My Drive/BertModel.h5')

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 100)]        0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 100)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 100)]        0                                            
__________________________________________________________________________________________________
bert_layer_1 (BertLayer)        (None, 768)          110104890   input_ids[0][0]                  
                                                                 input_masks[0][0]          

In [0]:
post_save_preds = model.predict([test_input_ids[0:100], 
                                test_input_masks[0:100], 
                                test_segment_ids[0:100]])

In [0]:
# type(post_save_preds), post_save_preds.shape

***Displaying Predicted Results***

In [0]:
pred_dict = {}
list_ = []
# "toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"
for i in post_save_preds:
    pred_dict['toxic'] = str((i[0]*100)) + ' %'
    pred_dict['severe_toxic'] = str(i[1]*100) + ' %'
    pred_dict['obscene'] = str(i[2]*100) + ' %'
    pred_dict['threat'] = str(i[3]*100) + ' %'
    pred_dict['insult'] = str(i[4]*100) + ' %'
    pred_dict['identity_hate'] = str(i[5]*100) + ' %'
    list_.append(pred_dict)

In [0]:
cnt = 0
for i in range(len(list_)):
    cnt += 1
    if cnt < 5:
        print(list_[i], "\n")
    else:
        break

In [0]:
np.all(pre_save_preds == post_save_preds) # Are they the same?

#### Prediction on unknown data

In [0]:
p_dat = pd.read_excel(r'C:/Users/A716717/Eashan_practice/study/practice/data/val_mini.xlsx')
p_dat.head()

***Data Preprocessing***

In [0]:
p_dat['labels'] = list(zip(p_dat.toxic.tolist(), p_dat.severe_toxic.tolist(), p_dat.obscene.tolist(), p_dat.threat.tolist(), p_dat.insult.tolist(), p_dat.identity_hate.tolist()))

In [0]:
p_text = p_dat['comment_text'].tolist()

In [0]:
p_text

[":If you have a look back at the source, the information I updated was the correct form. I can only guess the source hadn't updated. I shall update the information once again but thank you for your message.",
 "I don't anonymously edit articles at all.",
 'Thank you for understanding. I think very highly of you and would not revert without discussion.',
 'Please do not add nonsense to Wikipedia. Such edits are considered vandalism and quickly undone. If you would like to experiment, please use the sandbox instead. Thank you.   -',
 ':Dear god this site is horrible.']

In [0]:
p_text = [' '.join(t.split()[0:max_seq_length]) for t in p_text]

In [0]:
p_text = np.array(p_text, dtype=object)[:, np.newaxis]

In [0]:
p_label = p_dat['labels'].values

In [0]:
#To do
p_examples = convert_text_to_examples(p_text, p_label)

In [0]:
(p_input_ids, p_input_masks, p_segment_ids, train_labels 
) = convert_examples_to_features(tokenizer, p_examples, max_seq_length=max_seq_length)

HBox(children=(IntProgress(value=0, description='Converting examples to features', max=5, style=ProgressStyle(…




#### Prediction

In [0]:
p_preds = model.predict([p_input_ids, 
                                p_input_masks, 
                                p_segment_ids])

In [0]:
pred_dict = {}
list_p = []
# "toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"
for i in p_preds:
    pred_dict['toxic'] = str((i[0]*100)) + ' %'
    pred_dict['severe_toxic'] = str(i[1]*100) + ' %'
    pred_dict['obscene'] = str(i[2]*100) + ' %'
    pred_dict['threat'] = str(i[3]*100) + ' %'
    pred_dict['insult'] = str(i[4]*100) + ' %'
    pred_dict['identity_hate'] = str(i[5]*100) + ' %'
    list_p.append(pred_dict)

In [0]:
for i in range(len(list_p)):
    print(list_p[i], "\n")

{'toxic': '9.895411133766174 %', 'severe_toxic': '0.05812942981719971 %', 'obscene': '0.2048194408416748 %', 'threat': '0.08640885353088379 %', 'insult': '0.5824130028486252 %', 'identity_hate': '0.07198824314400554 %'} 

{'toxic': '9.895411133766174 %', 'severe_toxic': '0.05812942981719971 %', 'obscene': '0.2048194408416748 %', 'threat': '0.08640885353088379 %', 'insult': '0.5824130028486252 %', 'identity_hate': '0.07198824314400554 %'} 

{'toxic': '9.895411133766174 %', 'severe_toxic': '0.05812942981719971 %', 'obscene': '0.2048194408416748 %', 'threat': '0.08640885353088379 %', 'insult': '0.5824130028486252 %', 'identity_hate': '0.07198824314400554 %'} 

{'toxic': '9.895411133766174 %', 'severe_toxic': '0.05812942981719971 %', 'obscene': '0.2048194408416748 %', 'threat': '0.08640885353088379 %', 'insult': '0.5824130028486252 %', 'identity_hate': '0.07198824314400554 %'} 

{'toxic': '9.895411133766174 %', 'severe_toxic': '0.05812942981719971 %', 'obscene': '0.2048194408416748 %', 'th