<a href="https://colab.research.google.com/github/StephanHav/github-slideshow/blob/master/sBERT_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Thesis Nationale Politielab AI - STS (sBERT) and Text Classification

## 06-05-2021

# Part 1

## Fine-tuning distilBERT for clickbait classification task

First, we will import a dataset containing article titles that has labels referring to whether they are clickbait or not (1 = Clickbait, 0 = Not Clickbait)

In [13]:
import pandas as pd

DATASET_URL = 'https://gist.githubusercontent.com/amitness/0a2ddbcb61c34eab04bad5a17fd8c86b/raw/66ad13dfac4bd1201e09726677dd8ba8048bb8af/clickbait.csv'
data = pd.read_csv(DATASET_URL)
data.head(5)

Unnamed: 0,title,label
0,"15 Highly Important Questions About Adulthood,...",1
1,250 Nuns Just Cycled All The Way From Kathmand...,1
2,"Australian comedians ""could have been shot"" du...",0
3,Lycos launches screensaver to increase spammer...,0
4,Fußball-Bundesliga 2008–09: Goalkeeper Butt si...,0


Create train/test/val split, train and test for training the model, val for selecting the right one.

In [None]:
from sklearn.model_selection import train_test_split

X = list(data.title.values) # the texts --> X
y = list(data.label.values) # the labels we want to predict --> Y
labels = ['not clickbait', 'clickbait']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=1)

In [None]:
len(data)

31986

Load the tokenizer, tokenize the datasets, and convert these to Tensorflow objects

In [None]:
!pip install transformers

import tensorflow as tf
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')

train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=128) # convert input strings to BERT encodings
test_encodings = tokenizer(X_test, truncation=True, padding=True,  max_length=128)
val_encodings = tokenizer(X_val, truncation=True, padding=True, max_length=128)

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    y_train
)).shuffle(1000).batch(16) # convert the encodings to Tensorflow objects
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    y_val
)).batch(64)
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test
)).batch(64)

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 4.3MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 40.3MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 39.3MB/s 
Installing collected packages: sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




In [None]:
from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased', 
                                                           num_labels=len(labels))
callbacks = [
        tf.keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=0, 
                      mode='min', baseline=None, 
                      restore_best_weights=True)]

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer=optimizer, loss=loss)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=411.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=354041576.0, style=ProgressStyle(descri…




Some layers from the model checkpoint at distilbert-base-cased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'vocab_projector', 'vocab_transform', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_19']
You should probably TRAIN this model on a down-stream task to be able to use it fo

In [None]:
model.fit(train_dataset, 
            epochs=10,
          callbacks=callbacks, 
          validation_data=val_dataset,
           batch_size=16)

Epoch 1/10
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10


<tensorflow.python.keras.callbacks.History at 0x7fe4c72ce190>

In [None]:
import numpy as np
from sklearn.metrics import classification_report 

logits = model.predict(test_dataset)
y_preds = np.argmax(logits[0], axis=1)
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99      1586
           1       0.99      1.00      0.99      1613

    accuracy                           0.99      3199
   macro avg       0.99      0.99      0.99      3199
weighted avg       0.99      0.99      0.99      3199



Predicting unlabeled examples

In [None]:
new_examples = ["14 things you never knew about the police", 
               "National Police AI Lab: a cooperation between universities and law enforcement"]
examples_encodings = tokenizer(new_examples, truncation=True, padding=True)
examples_encodings = tf.data.Dataset.from_tensor_slices((
                    dict(examples_encodings)
                      )).batch(64)
pred_logits = model.predict(examples_encodings)

for i, logits in enumerate(pred_logits[0]):
    prediction = np.argmax(logits)
    print("{}: {}".format(new_examples[i], labels[prediction]))

14 things you never knew about the police: clickbait
National Police AI Lab: a cooperation between universities and law enforcement: not clickbait


Getting the probabilities from the logits through a softmax

In [None]:
softmax = lambda x : np.exp(x)/sum(np.exp(x))
for i, logits in enumerate(pred_logits[0]):
    proba = softmax(logits)
    probability_not_clickbait = proba[0]
    probability_clickbait = proba[1]
    print("{}: {}% not clickbait; {}% clickbait".format(new_examples[i], 
                                                    round(probability_not_clickbait, 3),
                                                    round(probability_clickbait, 3)))

14 things you never knew about the police: 0.4339999854564667% not clickbait; 0.5659999847412109% clickbait
National Police AI Lab: a cooperation between universities and law enforcement: 1.0% not clickbait; 0.0% clickbait


# Part 2

## sBERT - Semantic Similarity and clickbait classification

In [1]:
!pip install -U sentence-transformers

Requirement already up-to-date: sentence-transformers in /usr/local/lib/python3.7/dist-packages (1.1.0)
Collecting transformers<5.0.0,>=3.1.0
  Using cached https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl
Collecting tokenizers<0.11,>=0.10.1
  Using cached https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl
Installing collected packages: tokenizers, transformers
  Found existing installation: tokenizers 0.7.0
    Uninstalling tokenizers-0.7.0:
      Successfully uninstalled tokenizers-0.7.0
  Found existing installation: transformers 2.11.0
    Uninstalling transformers-2.11.0:
      Successfully uninstalled transformers-2.11.0
Successfully installed tokenizers-0.10.2 transformers-4.5.1


In [2]:
!pip install -U transformers==3.4.0

Collecting transformers==3.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/2c/4e/4f1ede0fd7a36278844a277f8d53c21f88f37f3754abf76a5d6224f76d4a/transformers-3.4.0-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 4.3MB/s 
Collecting tokenizers==0.9.2
[?25l  Downloading https://files.pythonhosted.org/packages/35/e7/edf655ae34925aeaefb7b7fcc3dd0887d2a1203ee6b0df4d1170d1a19d4f/tokenizers-0.9.2-cp37-cp37m-manylinux1_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 13.0MB/s 
Installing collected packages: tokenizers, transformers
  Found existing installation: tokenizers 0.9.3
    Uninstalling tokenizers-0.9.3:
      Successfully uninstalled tokenizers-0.9.3
  Found existing installation: transformers 3.5.1
    Uninstalling transformers-3.5.1:
      Successfully uninstalled transformers-3.5.1
Successfully installed tokenizers-0.9.2 transformers-3.4.0


Had some dependency issues getting sBERT to work here, but these versions seem to work. if transformers => 3.5.1 it raises an ImportError: SAVE_STATE_WARNING. While versions < 3.4.0 raised a TypeError

In [8]:
pip freeze | grep transformers

sentence-transformers==1.1.0
transformers==3.4.0


In [None]:

#Shows how sentences are embedded but output is extremely long so I removed it.

from sentence_transformers import SentenceTransformer


model = SentenceTransformer('paraphrase-distilroberta-base-v1')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.',
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

#Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

In [2]:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('paraphrase-distilroberta-base-v1')

#Sentences are encoded by calling model.encode()
emb1 = model.encode("This is a red cat with a hat.")
emb2 = model.encode("Have you seen my red cat?")

cos_sim = util.pytorch_cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

Cosine-Similarity: tensor([[0.5625]])


In [19]:
data[0:10]


Unnamed: 0,title,label
0,"15 Highly Important Questions About Adulthood,...",1
1,250 Nuns Just Cycled All The Way From Kathmand...,1
2,"Australian comedians ""could have been shot"" du...",0
3,Lycos launches screensaver to increase spammer...,0
4,Fußball-Bundesliga 2008–09: Goalkeeper Butt si...,0
5,"In Afghanistan, Soldiers Bridge 2 Stages of War",0
6,"After Fleeing North Korea, an Artist Parodies ...",0
7,Lessons (or Not) When a Start-Up Misses the Mark,0
8,Court Issues Order Against 3 Car-Warranty Call...,0
9,How Much Would Chris Traeger Like You Based On...,1


Selecting new sentences from the clickbait dataset to assess whether sBERT, without finetuning, puts clickbait titles closer to each other in their embeddings than non-clickbait titles. In the examples I picked I will compare the title with index 0 (clickbait) to the titles with indexes 1 (clickbait),2 (not-clickbait), 4 (not-clickbait) and 9 (clickbait). 

In [20]:
cb_emb1 = model.encode(data.title[0])

for i in (1,2,4,9):
  cb_emb2 = model.encode(data.title[i])
  cos_sim = util.pytorch_cos_sim(cb_emb1, cb_emb2)
  print("Cosine-Similarity 0 to {}:".format(i), cos_sim)

Cosine-Similarity 0 to 1: tensor([[0.0528]])
Cosine-Similarity 0 to 2: tensor([[0.1193]])
Cosine-Similarity 0 to 4: tensor([[0.0599]])
Cosine-Similarity 0 to 9: tensor([[0.1017]])


In [27]:
sentences = list(data.title[0:100])

#Encode all sentences
embeddings = model.encode(sentences)

#Compute cosine similarity between all pairs
cos_sim = util.pytorch_cos_sim(embeddings, embeddings)

#Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

#Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))

Top-5 most similar pairs:
30 Decadent Fall Desserts For People Who Don't Like Pumpkin 	 17 Incredibly Helpful Charts For Cooking Thanksgiving Dinner 	 0.4924
22 Dogs Who Just Found Their Forever Homes 	 22 Dogs Who Totally Nailed Their Geeky Halloween Costumes 	 0.4565
26 Inspiring Dogs From NYC's Biggest Halloween Parade 	 22 Dogs Who Totally Nailed Their Geeky Halloween Costumes 	 0.4503
10 Life-Changing Things To Try In November 	 18 Ultra-Personalized Gifts To Keep For Yourself 	 0.4471
British government scraps planned rules on pay equality 	 UK MPs vote not to lower abortion limit 	 0.4462


Since sBERT here is not pretrained, it might be very difficult to have it perform the task you want. Considering the outcome above the results are not outrageous but how do we tell sBERT it needs to detect clickbait as opposed to semantics of the sentence. Would a similar problem arise when using unsupervised sBERT for threatening sentence detection?

There are many different pretrained versions of sBERT that have been trained for different specific tasks, there doesnt seem to be one specific enough for things such as clickbait detection or threat detection.

In [28]:
model = SentenceTransformer('stsb-mpnet-base-v2')

cb_emb1 = model.encode(data.title[0])

for i in (1,2,4,9):
  cb_emb2 = model.encode(data.title[i])
  cos_sim = util.pytorch_cos_sim(cb_emb1, cb_emb2)
  print("Cosine-Similarity 0 to {}:".format(i), cos_sim)

HBox(children=(FloatProgress(value=0.0, max=403747457.0), HTML(value='')))




KeyError: ignored

In [29]:
model = SentenceTransformer('stsb-distilroberta-base-v2')

cb_emb1 = model.encode(data.title[0])

for i in (1,2,4,9):
  cb_emb2 = model.encode(data.title[i])
  cos_sim = util.pytorch_cos_sim(cb_emb1, cb_emb2)
  print("Cosine-Similarity 0 to {}:".format(i), cos_sim)

HBox(children=(FloatProgress(value=0.0, max=305208253.0), HTML(value='')))


Cosine-Similarity 0 to 1: tensor([[-0.1138]])
Cosine-Similarity 0 to 2: tensor([[0.1200]])
Cosine-Similarity 0 to 4: tensor([[0.0161]])
Cosine-Similarity 0 to 9: tensor([[0.0973]])


In [30]:
sentences = list(data.title[0:100])

#Encode all sentences
embeddings = model.encode(sentences)

#Compute cosine similarity between all pairs
cos_sim = util.pytorch_cos_sim(embeddings, embeddings)

#Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim)-1):
    for j in range(i+1, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

#Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))

Top-5 most similar pairs:
26 Inspiring Dogs From NYC's Biggest Halloween Parade 	 22 Dogs Who Totally Nailed Their Geeky Halloween Costumes 	 0.5390
22 Dogs Who Just Found Their Forever Homes 	 22 Dogs Who Totally Nailed Their Geeky Halloween Costumes 	 0.5322
30 Decadent Fall Desserts For People Who Don't Like Pumpkin 	 17 Incredibly Helpful Charts For Cooking Thanksgiving Dinner 	 0.5142
British government scraps planned rules on pay equality 	 UK rail firm cuts 180 jobs 	 0.4449
19 Realities All Women With Big Boobs Know To Be True 	 37 Things The Kardashians Have 100% Actually Said 	 0.4143
