# About this kernel

[Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/4) is a model created and publicly made available by Google. Built in Tensorflow, it was trained to embed any type of sentences or short paragraphs so that the meaning is as much preserved as possible; so that it can be finetuned for classification tasks specifically.

This implementation is extremely pleasant to use, since the input is simply the string, and the output is just the 512-dimensional encoded sentence; no preprocessing is needed. It is also fully-trainable, and uses an almost state-of-the-art architecture (namely, pre-BERT transformers (nice [comparison in this blog post](https://blog.floydhub.com/when-the-best-nlp-model-is-not-the-best-choice/)).

Since this competition is dedicated for beginners to get started, I feel this is a perfect example of using a novel technology, but packaged in a gentle and correctly abstracted API (as opposed to the horrors of [1000 lines of tensorflow code](https://github.com/google-research/bert/blob/master/modeling.py) that needs to be understood before modifying BERT). Here, instead, **I'm only showing you some 50 lines of codes to get all up and running**; and only ~15 lines to setup the model!

## Summary

This kernel serves as a short and straightforward introduction to the process of:
1. Loading a trained model from [Tensorflow hub](https://tfhub.dev/).
2. Building a `Sequential` Keras model by using the trained model as a layer.
3. Training the newly created Keras model, and perform inference.

In [None]:
# Tensorflow
# !pip uninstall -y tf-hub-nightly
# !pip uninstall -y tensorflow-hub
# !pip uninstall -y tensorflow
# !pip install tensorflow==2.0.0
# !pip install tensorflow_hub==0.7.0

import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.utils import plot_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
# from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Input, InputLayer, BatchNormalization, Dropout, Concatenate, Layer
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow.keras import regularizers

!pip install bert-for-tf2
import bert

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import average_precision_score, auc, classification_report, confusion_matrix, roc_curve, precision_recall_curve

from nltk.corpus import stopwords
from nltk.util import ngrams

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import nltk
from collections import defaultdict
from collections import  Counter
import re
import gensim
import string
from tqdm import tqdm, tqdm_notebook


In [None]:
# For finding Tensorflow version
print("TF version: ", tf.__version__)
print("Hub version: ", hub.__version__)

# Load data and model

First load all the CSV files we will need

In [None]:
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")

print(type(train))

Create convenient names for the variables we will be using for training and inference.

In [None]:
# train_data = train.text.values
# train_labels = train.target.values
# test_data = test.text.values
# train_data.shape


### Clean Data

In [None]:
!pip install contractions
!pip install beautifulsoup4

import contractions
from bs4 import BeautifulSoup
import unicodedata
import re

def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    return stripped_text

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

def expand_contractions(text):
    return contractions.fix(text)

def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
    text = re.sub(pattern, '', text)
    return text

def pre_process_document(document):
    # strip HTML
    document = strip_html_tags(document)
    # lower case
    document = document.lower()
    # remove extra newlines (often might be present in really noisy text)
    document = document.translate(document.maketrans("\n\t\r", "   "))
    # remove accented characters
    document = remove_accented_chars(document)
    # expand contractions    
    document = expand_contractions(document)  
    # remove special characters and\or digits    
    # insert spaces between special characters to isolate them    
    special_char_pattern = re.compile(r'([{.(-)!}])')
    document = special_char_pattern.sub(" \\1 ", document)
    document = remove_special_characters(document, remove_digits=True)  
    # remove extra whitespace
    document = re.sub(' +', ' ', document)
    document = document.strip()
    
    return document


pre_process_corpus = np.vectorize(pre_process_document)

In [None]:
# %%time

# pd.options.display.max_colwidth = 200

# print(train[train['target']==1]['text'].values[:10])

# print(train[train['target']==0]['text'].values[:10])

# train_data = pre_process_corpus(train.text)
# print("T Shape:" + str(train_data.shape))
# # train_data = pd.DataFrame(train_data)
# train_labels = train.target.values
# test_data = pre_process_corpus(test.text)

# X_train,X_test,y_train,y_test=train_test_split(train_data,train_labels,test_size=0.20, random_state = 777, shuffle = True)

## Load BERT Module

In [None]:
%%time

bert_path = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1"
max_seq_length = 256
# bert_embedding_module = hub.Module(bert_path, trainable=False, name='bert_embedding_module_2')

bert_embedding_layer = hub.KerasLayer(bert_path, trainable=False, name='bert_embedding_layer')
# hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
#                             trainable=True)

In [None]:
def get_masks(tokens, max_seq_length):
    """Mask for padding"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    return [1]*len(tokens) + [0] * (max_seq_length - len(tokens))


def get_segments(tokens, max_seq_length):
    """Segments: 0 for the first sequence, 1 for the second"""
    if len(tokens)>max_seq_length:
        raise IndexError("Token length more than max seq length!")
    segments = []
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            current_segment_id = 1
    return segments + [0] * (max_seq_length - len(tokens))


def get_ids(tokens, tokenizer, max_seq_length):
    """Token ids from Tokenizer vocab"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_ids = token_ids + [0] * (max_seq_length-len(token_ids))
    return input_ids

# vocab_file = bert_embedding_layer.resolved_object.vocab_file.asset_path.numpy()
# do_lower_case = bert_embedding_layer.resolved_object.do_lower_case.numpy()
FullTokenizer = bert.bert_tokenization.FullTokenizer
# tokenizer = FullTokenizer(vocab_file, do_lower_case)

vocab_file = bert_embedding_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_embedding_layer.resolved_object.do_lower_case.numpy()
tokenizer = FullTokenizer(vocab_file, do_lower_case)

# tokenizer.tokenize("sadas asdasd   asdasd").shape

In [None]:
a = [1,3]
print(type(a))
# print(a.shape)

np.array(a, dtype=object)[:, np.newaxis]

In [None]:
def convert_sentences_to_features(tokenizer, sentences, max_seq_length=256, is_test = False):
    """Convert a set of `InputExample`s to a list of `InputFeatures`."""
    
    stokens = sentences["text"].apply(lambda x : ["[CLS]"] + tokenizer.tokenize(x) + ["[SEP]"] )
#     np.array(a, dtype=object)[:, np.newaxis]
    stokens_ids = np.matrix(stokens.apply(lambda x : get_ids(x, tokenizer, max_seq_length)).tolist())
    stokens_masks = np.matrix(stokens.apply(lambda x : get_masks(x, max_seq_length)).tolist())
    stokens_segments = np.matrix(stokens.apply(lambda x : get_segments(x, max_seq_length)).tolist())

    assert stokens_ids[0].size == max_seq_length
    assert stokens_masks[0].size == max_seq_length
    assert stokens_segments[0].size == max_seq_length

    if is_test:
        return (
            stokens_ids,
            stokens_masks,
            stokens_segments,
        )
    else:
        return (
            stokens_ids,
            stokens_masks,
            stokens_segments,
            np.array(sentences["target"]).reshape(-1,1),
        )

# print(train)
(all_input_ids, all_input_masks, all_segment_ids, all_labels 
) = convert_sentences_to_features(tokenizer, train, max_seq_length=max_seq_length)
(test_input_ids, test_input_masks, test_segment_ids
) = convert_sentences_to_features(tokenizer, test, max_seq_length=max_seq_length, is_test = True)


In [None]:
print(type(all_input_ids))

# print(all_input_ids.values[0:2])

print(all_input_ids.shape)
print(all_input_masks.shape)
print(all_segment_ids.shape)
print(all_labels.shape)
print(test_input_ids.shape)
print(test_input_masks.shape)
print(test_segment_ids.shape)

### LOAD BERT MODEL

Finally, load the BERT Encoder from tfhub.dev (make sure Internet is enabled!).

### Customize BERT Model

# Train the model

Build a simple sequential model in Keras, with just a few lines. Note that the `Input` here is a tf.string; usually you will see integer inputs followed by an `Embedding` layer; those are needed for RNNs or CNNs, but here it is all taken care of internally by the USE; in other words, the `embed` layer you just loaded is internally tokenizing the strings, convert them to integers, then map them using an embedding.

If none of those words make any sense, worry not! USE was designed to be easily understood and directly used as is, so you don't have to get into the low-level implementation details, and can focus on using it as a tool in your Keras model, or use it as is.

In [None]:
with tf.device('/gpu:0'):
    def build_model(bert_embedding_layer, max_seq_length):
#         input_word_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
#                                        name="input_word_ids")
#         input_mask = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
#                                            name="input_mask")
#         segment_ids = tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32,
#                                             name="segment_ids")
        in_id = Input(shape=(max_seq_length,), dtype=tf.int32, name="input_ids")
        in_mask = Input(shape=(max_seq_length,), dtype=tf.int32, name="input_masks")
        in_segment = Input(shape=(max_seq_length,), dtype=tf.int32, name="segment_ids")
        bert_inputs = [in_id, in_mask, in_segment]
        
#         bert_output = BertLayer(n_fine_tune_layers=3, pooling="first")(bert_inputs)
        bert_pooled_output, bert_sequenced_output = bert_embedding_layer(bert_inputs)
        x = Dense(10, activation='relu')(bert_pooled_output) 
#         x = Dense(64, activation='relu')(bert_pooled_output) 
#         x = Dense(32, activation='relu')(x)
        output = Dense(1, activation='sigmoid')(x)
        
        model = Model(inputs=bert_inputs, outputs=output)
        
        optimizer = Adam(lr = 0.0001)

        metrics = [
            'accuracy', 
            tf.keras.metrics.Recall(),
            tf.keras.metrics.Precision()
        ]
        model.compile(optimizer, loss='binary_crossentropy', metrics=metrics)

        return model

Let's check if the model looks the way we want:

In [None]:
model = build_model( bert_embedding_layer, max_seq_length = max_seq_length)
model.summary()

Let's get started with the training step! We'll use 20% of the data to validate the results, and only save the model that has the lowest loss on that 20% data.

In [None]:

print("Shape of all inputs ",all_input_ids.shape)
print("Shape of all labels ",all_labels.shape)

# print("Shape of all inputs ",np.array(all_input_ids).reshape(all_input_ids.shape[0],).shape)

# train_input_ids, train_input_masks, train_segment_ids, train_labels 
train_input_ids, dev_input_ids, train_input_masks, dev_input_masks, train_segment_ids, dev_segment_ids, train_labels, dev_labels =train_test_split(
    np.array(all_input_ids),
    np.array(all_input_masks),
    np.array(all_segment_ids),
    np.array(all_labels),
    test_size=0.20, 
    random_state = 777, 
    shuffle = True)

print("Shape of Train ",train_input_ids.shape)
print("Shape of Dev ",dev_input_ids.shape)

checkpoint = ModelCheckpoint('model_with_low_val_loss.h5', monitor='val_loss', mode='min', modelsave_best_only=True)
early_stopping = EarlyStopping(monitor='val_loss', mode='min', min_delta=0.001, patience=10)
callbacks = [early_stopping, checkpoint]

train_history = model.fit(
    [train_input_ids, train_input_masks, train_segment_ids], 
    train_labels,
    validation_data=(
        [dev_input_ids, dev_input_masks, dev_segment_ids],
        dev_labels
    ),
    epochs=50,
    callbacks=callbacks,
    batch_size=32
)

In [None]:
model_loss = pd.DataFrame(model.history.history)
# model_loss.head()
model_loss[['loss','val_loss']].plot(ylim=[0,1])
model_loss[['accuracy','val_accuracy']].plot(ylim=[0,1])
pd.DataFrame(model.history.history).filter(regex="precision", axis=1).plot(ylim=[0,1])
pd.DataFrame(model.history.history).filter(regex="recall", axis=1).plot(ylim=[0,1])
# predictions = model.predict_classes(X_test) 
# print(classification_report(y_test, predictions, target_names=["Real", "Not Real"]))

# Load Best Model

## Calculate F1 Score

In [None]:
def findClasses(predictions):
    true_preds = []
    a=1
    b=0

    for i in predictions:
        if i >= 0.5:
            true_preds.append(a)
        else:
            true_preds.append(b)
    return true_preds
model.load_weights('model_with_low_val_loss.h5')
predictions = model.predict([dev_input_ids, dev_input_masks, dev_segment_ids]) 
classes = findClasses(predictions)
print(classification_report(dev_labels, classes, target_names=["Real", "Not Real"]))

# Inference

Don't forget that the latest model might not be the best! Instead, the best is the one we saved as `model.h5`; let's load it and run prediction on `test_data`.

In [None]:

test_pred = model.predict(test_data)

Finally, we round the predictions, set them to integer, update the `submission` dataframe, and save it as CSV... Oof!

In [None]:
pred =  pd.DataFrame(test_pred, columns=['preds'])
pred.plot.hist()

In [None]:
# This for loop its for round predictions
submission['target'] = findClasses(test_pred)
submission.to_csv('submission.csv', index=False)