In this notebook we will fine-tune a BERT model to classify texts.

The dataset we will use is Offensive Language Identification (OLID), where short texts in English are labeled for offensiveness. We focus on subtask A: binary classification of offensiveness.

In [None]:
!wget https://sites.google.com/site/offensevalsharedtask/olid/OLIDv1.0.zip
!unzip OLIDv1.0.zip

--2022-09-22 11:35:39--  https://sites.google.com/site/offensevalsharedtask/olid/OLIDv1.0.zip
Resolving sites.google.com (sites.google.com)... 74.125.68.138, 74.125.68.102, 74.125.68.139, ...
Connecting to sites.google.com (sites.google.com)|74.125.68.138|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://sites.google.com/site/offensevalsharedtask/olid/OLIDv1.0.zip?attredirects=0 [following]
--2022-09-22 11:35:39--  https://sites.google.com/site/offensevalsharedtask/olid/OLIDv1.0.zip?attredirects=0
Reusing existing connection to sites.google.com:443.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://ef80a887-a-62cb3a1a-s-sites.googlegroups.com/site/offensevalsharedtask/olid/OLIDv1.0.zip?attachauth=ANoY7crtQPI3OhKVcBi_oZQzLJ_NlG7OjLAlKhf_8y_CLIwAwCXXWCoqPkYogok6Rh-dTwndw2mmvo4lCkoOckfF792gCEAovaDTWjBrsjPT0bQvFDbf4BgY9UOp_Lvz77sboA7ag3LAXlFpxW_fgc92MBR9m1KqpinV_VJ704ktyAf3dyORWX_B_2C0Ilqo3HhvzMxqJ497qW_it0iuSAl

In [None]:
import csv

data_train = []
labels_train = []

with open("olid-training-v1.0.tsv") as f:
    reader = csv.DictReader(f, delimiter="\t")
    for row in reader:
        data_train.append(row["tweet"])
        labels_train.append(row["subtask_a"])

data_test = []
labels_test = []
with open("testset-levela.tsv") as f:
    reader = csv.DictReader(f, delimiter="\t")
    for row in reader:
        data_test.append(row["tweet"])

with open("labels-levela.csv") as f:
    reader = csv.DictReader(f, fieldnames=["id", "label"])
    for row in reader:
        labels_test.append(row["label"])


In [None]:
!pip install simpletransformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting simpletransformers
  Downloading simpletransformers-0.63.9-py3-none-any.whl (250 kB)
[K     |████████████████████████████████| 250 kB 14.6 MB/s 
[?25hCollecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 1.1 MB/s 
Collecting streamlit
  Downloading streamlit-1.12.2-py2.py3-none-any.whl (9.1 MB)
[K     |████████████████████████████████| 9.1 MB 60.9 MB/s 
Collecting tokenizers
  Downloading tokenizers-0.13.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (7.0 MB)
[K     |████████████████████████████████| 7.0 MB 62.7 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 67.0 MB/s 
[?25hCollecting wandb>=0.10.32
  Downloading wandb-0.13.3-py2.py3-none-any.whl (1.8 MB)
[K     |██████

In [None]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import pandas as pd

# Shape the data like simpletransformers wants it
train_df = pd.DataFrame([[text, label] for text, label in zip(data_train, labels_train)])
train_df.columns = ["text", "labels"]

# Model configuration
model_args = ClassificationArgs()
model_args.num_train_epochs=2
model_args.labels_list = list(set(labels_train))
model_args.train_batch_size = 16
model_args.overwrite_output_dir = True

# Create a ClassificationModel
model = ClassificationModel(
    "bert", "bert-base-uncased", args=model_args
)

# Train the model
model.train_model(train_df)

# Make predictions with the model
predictions, raw_outputs = model.predict(data_test)


The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

  0%|          | 0/13240 [00:00<?, ?it/s]

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 0 of 2:   0%|          | 0/828 [00:00<?, ?it/s]

Running Epoch 1 of 2:   0%|          | 0/828 [00:00<?, ?it/s]

  0%|          | 0/860 [00:00<?, ?it/s]

  0%|          | 0/108 [00:00<?, ?it/s]

In [None]:
from sklearn.metrics import classification_report

print (classification_report(labels_test, predictions))

              precision    recall  f1-score   support

         NOT       0.87      0.92      0.89       620
         OFF       0.75      0.65      0.70       240

    accuracy                           0.84       860
   macro avg       0.81      0.79      0.80       860
weighted avg       0.84      0.84      0.84       860

