# Sarcasm Detection: BERT

## Author: Elsa Scola Martín
### Objective:
Using the dataset [News Headlines Dataset For Sarcasm Detection](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection) created by [Rishabh Misra and Prahal Arora](https://arxiv.org/abs/1908.07414), the goal of this notebook is to illustrate the implementation of BERT for sarcasm detection.

### What is done in the Notebook: 
- Load the required dependencies
- Define helper functions
- Load BERT from the Tensorflow Hub
- Load CSV files containing data
- Load tokenizer
- Text encoding
- Build model
- Save the best model and early stopping
- Fit the model
- Evaluate model results with test data
- Extract False Positives and False Negatives

### Load the required dependencies

In [0]:
# We will use the official tokenization script created by the Google team
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

In [2]:
!pip install sentencepiece

Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/98/2c/8df20f3ac6c22ac224fff307ebc102818206c53fc454ecd37d8ac2060df5/sentencepiece-0.1.86-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 5.0MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.86


In [3]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub
from sklearn import model_selection
from sklearn import metrics
import keras

from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ModelCheckpoint

import tokenization

Using TensorFlow backend.


### Define Helper Functions
[Source](https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub) of helper functions.

In [0]:
def bert_encode(texts, tokenizer, max_len=160):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [0]:
def build_model(bert_layer, max_len=160):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=2e-6), loss='binary_crossentropy', metrics=['accuracy',keras.metrics.Precision(), keras.metrics.Recall(), keras.metrics.TruePositives()])
    
    return model

### Load BERT from the Tensorflow Hub

In [6]:
%%time
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

CPU times: user 20.9 s, sys: 4.22 s, total: 25.1 s
Wall time: 34.7 s


In [7]:
# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


### Load CSV files containing data

In [0]:
train = pd.read_csv("/content/drive/My Drive/TFMColab/train.csv")
val = pd.read_csv("/content/drive/My Drive/TFMColab/val.csv")
test = pd.read_csv("/content/drive/My Drive/TFMColab/test.csv")

### Load tokenizer

In [0]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

### Text encoding

In [0]:
train_input = bert_encode(train.headline.values, tokenizer, max_len = 160)
test_input = bert_encode(test.headline.values, tokenizer, max_len = 160)
val_input = bert_encode(val.headline.values, tokenizer, max_len = 160)

train_labels = train.is_sarcastic.values
test_labels = test.is_sarcastic.values
val_labels = val.is_sarcastic.values

### Build model

In [12]:
model = build_model(bert_layer, max_len = 160)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_word_ids (InputLayer)     [(None, 160)]        0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 160)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 160)]        0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 1024), (None 335141889   input_word_ids[0][0]             
                                                                 input_mask[0][0]             

### Save the best model and early stopping

To prevent the model from overfitting early stopping has been enabled.

Early stopping is a method that allows us to specify an arbitrary large number of training epochs and stop training once the model performance stops improving on a hold out/validation dataset.


In [13]:
# Save the model after every epoch.
saveBestModel = ModelCheckpoint('best_model.hdf5', monitor='val_acc', verbose=0, save_best_only=True, save_weights_only=False, mode='auto', period=1)
# Stop training when a monitored quantity has stopped improving.
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='auto')





### Fit the model


In [15]:
train_history = model.fit(
    train_input, train_labels,
    validation_data=(val_input, val_labels),
    epochs=10,
    batch_size=20,
    callbacks=[saveBestModel, earlyStopping]
)

#model.save('model.h5')

Epoch 1/10




Epoch 2/10




Epoch 3/10




Epoch 4/10




Epoch 5/10






### Evaluate model results with test data

Results were obtained by using the 'predict' function.

In [0]:
test_pred = model.predict(test_input)
test_pred = test_pred.round().astype(int)

In [0]:
recall = metrics.recall_score(test_labels,test_pred)
precision = metrics.precision_score(test_labels,test_pred)
f1_score = metrics.f1_score(test_labels,test_pred)
accuracy = metrics.accuracy_score(test_labels,test_pred)
loss = metrics.log_loss(test_labels,test_pred)


In [33]:
print('Loss:',loss)
print('Accuracy:',accuracy)
print('Precision:',precision)
print('Recall:',recall)
print('f1 score:',f1_score)

Loss: 2.9443963599750527
Accuracy: 0.9147517979301877
Precision: 0.9244851258581236
Recall: 0.8938053097345132
f1 score: 0.9088863892013498


In [34]:
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
# kappa
kappa = cohen_kappa_score(test_labels,test_pred)
print('Cohens kappa: %f' % kappa)
# ROC AUC
auc = roc_auc_score(test_labels,test_pred)
print('ROC AUC: %f' % auc)
# confusion matrix
matrix = confusion_matrix(test_labels,test_pred)
print(matrix)

Cohens kappa: 0.828837
ROC AUC: 0.913781
[[2791  198]
 [ 288 2424]]


### Extract False Positives and False Negatives

False Positives and False Negatives are stored in a CSV file for posterior analysis.

In [0]:
def getFP_FN_lists(test_X, test_y, pred_y):
    FP_text = []
    FP_index = []
    FN_text = []
    FN_index = []
    for i in range(len(test_y)):
        if(pred_y[i]==1 and test_y[test_y.index[i]]==0):
            FP_text.append(test['headline'][test_y.index[i]])
            FP_index.append(test_y.index[i])
        elif(pred_y[i]==0 and test_y[test_y.index[i]]==1):
            FN_text.append(test['headline'][test_y.index[i]])
            FN_index.append(test_y.index[i])
            
    return FP_text,FP_index,FN_text,FN_index

In [0]:
'''Returns 2 dataframes, one with all the False Positives and one with all the False Negatives'''
def getFP_FN(test_X, test_y, pred_y):
    FP_text,FP_index,FN_text,FN_index = getFP_FN_lists(test_X, test_y, pred_y)
    d_FP = {'FP_text':FP_text,'FP_index':FP_index}
    df_FP = pd.DataFrame(d_FP)
    d_FN = {'FN_text':FN_text,'FN_index':FN_index}
    df_FN = pd.DataFrame(d_FN)
    
    return df_FP,df_FN

In [0]:
# We get the FPs and FNs as DataFrames and store them to CSVs
df_FP,df_FN = getFP_FN(test['headline'], test['is_sarcastic'],test_pred)
df_FP.to_csv('bert_FP.csv', index=True)
df_FN.to_csv('bert_FN.csv', index=True)