In [1]:
!pip install --upgrade transformers # only run this once per kernel session - dont want to overload kaggle

Collecting transformers
  Downloading transformers-3.5.1-py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 1.3 MB/s 
Collecting tokenizers==0.9.3
  Downloading tokenizers-0.9.3-cp37-cp37m-manylinux1_x86_64.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 4.7 MB/s 
[?25hCollecting sentencepiece==0.1.91
  Downloading sentencepiece-0.1.91-cp37-cp37m-manylinux1_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 10.4 MB/s 
Installing collected packages: tokenizers, sentencepiece, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.9.2
    Uninstalling tokenizers-0.9.2:
      Successfully uninstalled tokenizers-0.9.2
  Attempting uninstall: sentencepiece
    Found existing installation: sentencepiece 0.1.94
    Uninstalling sentencepiece-0.1.94:
      Successfully uninstalled sentencepiece-0.1.94
  Attempting uninstall: transformers
    Found existing installation: transfo

Explanation of each library we use here:
* numpy - numerical linear algebra library, makes it easy for us to do some array operations
* transformers - huggingface's transformers NLP library, contains the BERT model + tokenizer that we will use
* pickle - a library used for reading pickled files. Our data is pickled, so we need to use this library to open the data
* tensorflow - the ML training library that we will use to train BERT and fine-tune it
* re - regex python library, used in data cleaning
* nltk - natural language toolkit, using in data cleaning
* matplotlib - plotting library
* seaborn - a nicer plotting library

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import transformers as ppb # BERT Model
import pickle # decode pickled data
import tensorflow as tf
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import re
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import regexp_tokenize
import matplotlib.pyplot as plt
import seaborn as sns

First, load all the data into the program. We use the given training and testing set.

In [3]:
train_x = pd.read_pickle("../input/humor-detection/X_train.pickle")
train_y = pd.read_pickle("../input/humor-detection/y_train.pickle")
test_x = pd.read_pickle("../input/humor-detection/X_test.pickle")
test_y = pd.read_pickle("../input/humor-detection/y_test.pickle")

Below is the main model and tokenizer setup.

To tokenize our data, we use huggingface's BertTokenizer, which does some extra stuff on top of our own cleaning, like adding tokens like \[CLS\], and other necessary steps for BERT. We still have to do the same data cleaning that we had done previously, and so I copied over the steps from there into here.

The model is the TFBertForSequenceClassification model, which is basically a seqeunce classifier (like we want). This is better than previous models since the classification step and fine-tuning step are packed into one step, making it easier for us to use and work with.

In [4]:
tokenizer = ppb.AutoTokenizer.from_pretrained("bert-base-uncased")
model = ppb.TFBertForSequenceClassification.from_pretrained("bert-base-uncased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier', 'dropout_37']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


These data cleaning functions below perform the same functions as were done in the main data processing notebook:
* ``lemmatize()`` - lemmatizes the sentence input ``s``, helps simplify model vocabulary
* ``lower()`` - lowercases all the words in the sentence inputs ``s`` - is a remnant of a previous iteration, but I kept it around since it did no harm
* ``clean()`` - a generalized cleaning function that does the two steps above + removes all numbers from ``data``, list of sentences passed in
* ``tokenize()`` - tokenizes the list of sentences passed in ``text`` - this is what the ``BertTokenizer`` from the transformers library does. Returns an array of word vectors.
* ``process()`` - a combination of cleaning and tokenizing, a function really created for our ease of use

In [5]:
def lemmatize(s):
    wordnet_lemmatizer = WordNetLemmatizer()
    return " ".join([wordnet_lemmatizer.lemmatize(w,'v') for w in s.split(" ")])
def lower(s):
    return s.lower()
def clean(data):
    for item in data:
        lemmatize(item)
        lower(item)
        re.sub(r'\d+', '', item) # remove nums
    return data
def tokenize(text):
    tokenized = tokenizer(text, padding=True, truncation=True, return_tensors="tf")
    return tokenized
def process(data):
    cleaned = clean(data)
    return tokenize(data)

In [6]:
train_batch = process(train_x)
test_batch = process(test_x)

Below we set up the TensorFlow model that we're going to use to classify and fine-tune BERT.

In [7]:
import os
learning_rate = 2e-6
epochs = 10
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric1 = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
# callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath=os.path.dirname(save_path), save_weights_only=True, monitor='val_loss', mode='min', save_best_only=True)]
model.compile(optimizer=optimizer, loss=loss, metrics=[metric1])

In [8]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


In [9]:
history = model.fit(x=[train_batch.input_ids, train_batch.attention_mask], 
                    y=np.array(train_y), 
                    validation_data=([test_batch.input_ids, test_batch.attention_mask], np.array(test_y)), 
                    epochs=epochs,
                    batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [10]:
model.evaluate(x=[test_batch.input_ids,test_batch.attention_mask], y=np.array(test_y))



[0.054196543991565704, 0.9862891435623169]

In [11]:
from sklearn.metrics import classification_report
y_pred = model.predict(x=test_batch.input_ids)

In [12]:
print(y_pred[0].shape)
y_pred_bool = np.argmax(y_pred[0], axis=1)
print(np.array(test_y).shape)
print(y_pred_bool)
print(y_pred[0])
print(classification_report(test_y, y_pred_bool,))

(3355, 2)
(3355,)
[0 0 0 ... 0 0 0]
[[ 0.18150629 -0.46780977]
 [-0.02558374 -0.27848935]
 [-0.13714966 -0.17368306]
 ...
 [ 0.44547603 -0.6948679 ]
 [ 0.15049738 -0.51288646]
 [ 0.11995767 -0.4173474 ]]
              precision    recall  f1-score   support

           0       0.85      0.93      0.89      2304
           1       0.81      0.64      0.72      1051

    accuracy                           0.84      3355
   macro avg       0.83      0.79      0.80      3355
weighted avg       0.84      0.84      0.83      3355



In [13]:
def predict(joke, model):
    test_jokes = [
    "Don’t you hate people who use big words just to make themselves look perspicacious?",
    "My friend thinks he is smart. He told me an onion is the only food that makes you cry, so I threw a coconut at his face.",
    "So far, consumers haven’t returned to the sort of panic buying frenzy that sent packaged-food manufacturers scrambling earlier this year.",
    "Wall Street Week Ahead: Stock investors cast wary eye on yield rally",
    "What happens to a frog's car when it breaks down? It gets toad away."
    ]
    test_jokes.append(joke)
    tk = tokenizer(test_jokes, padding=True)
    out = model.predict(x=tk.input_ids)
    boolean_pred = np.argmax(out[0], axis=1)
    return boolean_pred[len(boolean_pred)-1] == 1

In [14]:
predict("I didn t climb to the top of the food chain to be a vegetarian...", model)

True

In [15]:
submission = pd.DataFrame({"Prediction":y_pred_bool})
submission.to_csv("predictions.csv", index=True, index_label="Id")
model.save_weights("./model/model_checkpoint.ckpt")

In [16]:
model_new = ppb.TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
model_new.load_weights("./model/model_checkpoint.ckpt")
model_new.compile(optimizer=optimizer, loss=loss, metrics=[metric1])
model_new.evaluate(x=[test_batch.input_ids,test_batch.attention_mask], y=np.array(test_y))

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertForSequenceClassification: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['dropout_75', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




[0.054196543991565704, 0.9862891435623169]

In [17]:
predict("The Michigan Data Science Team is amazing!", model_new)

True