# Text Classification

    This notebook is about fine-tuning a light version of a state of the art text classification model (BERT). For this particular use case we are trying to detect toxicity in social media comments. However, given the data, this notebook could be used to fine tune BERT to classify text in any category one could think of. The most common being sentiment analysis.  
    
    

In [72]:
import pandas as pd
import transformers
from datasets import Dataset
import tensorflow as tf
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import TrainingArguments, Trainer
import torch

## Importing Data

In [2]:
sample_submission = pd.read_csv('data/sample_submission.csv')
test_labels = pd.read_csv('data/test.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

    Every comment that is labeled as toxic will have a 1 in that column and 0 otherwise
    Here, since the majority of comments are not toxic. In order not to drown the model in information it does not need we are going to do something called undersampling. It is the process of only keeping enough comments in order to have a balanced dataset with all the classes. Toxic and not toxic

In [3]:
sample_size = len(train.loc[train.toxic==1])
train = pd.concat([train.loc[train.toxic==1], train.loc[train.toxic ==0].sample(sample_size)]).sample(frac =1)

In [35]:
train.loc[:, ['id', 'comment_text', 'toxic']].rename(columns = {'toxic': 'label'}).to_csv("Data/new_train.csv", index=False)

In [36]:
dataset = load_dataset('csv', data_files={'train': 'data/new_train.csv'})


Using custom data configuration default-13dafa5a7babe9ed


Downloading and preparing dataset csv/default to /Users/aziz.mosbah/.cache/huggingface/datasets/csv/default-13dafa5a7babe9ed/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /Users/aziz.mosbah/.cache/huggingface/datasets/csv/default-13dafa5a7babe9ed/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [37]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'comment_text', 'label'],
        num_rows: 30588
    })
})

## Selecting the model & Pre-processing the data

### Defining the functions we will need

In [38]:
def preprocess_function(examples):
    return tokenizer(examples['comment_text'], truncation=True)

### Defining the variables & objects we will need

In [39]:
model_checkpoint = "distilbert-base-uncased"
batch_size = 16
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
lr = 2e-5
num_epochs = 5
weight_decay=0.01

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /Users/aziz.mosbah/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
Model config DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.11.3",
  "vocab_size": 30522
}

loading file https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt from cache at /Users/aziz.mosbah/.cache/huggingface/transformers/0e1bbfda7f63a99bb52e3915dcf10c3c92122b827d92eb2d3

### Tokenizing the dataset
    This means turning the words into vectors for the model to ingest them 

In [40]:
pre_tokenizer_columns = set(dataset["train"].features)
encoded_dataset = dataset.map(preprocess_function, batched=True)
tokenizer_columns = list(set(encoded_dataset["train"].features) - pre_tokenizer_columns)
print("Columns added by tokenizer:", tokenizer_columns)

  0%|          | 0/31 [00:00<?, ?ba/s]

Columns added by tokenizer: ['attention_mask', 'input_ids']


    This is the first comment in the training set

In [41]:
encoded_dataset['train'][0]['comment_text']

"Yep, hopefully. ) I'm going to take a look at that, and a lot of the other interviews and sources, over the near future."

In [42]:
encoded_dataset

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'comment_text', 'id', 'input_ids', 'label'],
        num_rows: 30588
    })
})

    This is its tokenized version

In [43]:
print(encoded_dataset['train'][0]['input_ids'])

[101, 15624, 1010, 11504, 1012, 1007, 1045, 1005, 1049, 2183, 2000, 2202, 1037, 2298, 2012, 2008, 1010, 1998, 1037, 2843, 1997, 1996, 2060, 7636, 1998, 4216, 1010, 2058, 1996, 2379, 2925, 1012, 102]


## Setting up a trainer

In [44]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=weight_decay,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset["train"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: id, comment_text.
***** Running training *****
  Num examples = 30588
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 9560


Step,Training Loss
500,0.2342
1000,0.1908
1500,0.1767
2000,0.1676
2500,0.114
3000,0.1267
3500,0.1218
4000,0.1004
4500,0.0735
5000,0.0783


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1500/special_toke

TrainOutput(global_step=9560, training_loss=0.0895507743667858, metrics={'train_runtime': 51393.6024, 'train_samples_per_second': 2.976, 'train_steps_per_second': 0.186, 'total_flos': 1.3184079681331056e+16, 'train_loss': 0.0895507743667858, 'epoch': 5.0})

## Classification
    Given an input the model is going to classify it as either toxic or not toxic

In [87]:
import warnings
warnings.filterwarnings('ignore')

In [88]:
input_text = 'Hello me dog is cute'
inputs = tokenizer(input_text, truncation=True, return_tensors="pt")
outputs = model(**inputs)
torch.nn.functional.softmax(outputs.logits)
print(f"Probability of toxicity {float(torch.nn.functional.softmax(outputs.logits)[0,1])}")


Probability of toxicity 0.003646745579317212


In [89]:
input_text = 'No your dog is the most disgusting creature I have ever seen'
inputs = tokenizer(input_text, truncation=True, return_tensors="pt")
outputs = model(**inputs)
torch.nn.functional.softmax(outputs.logits)
print(f"Probability of toxicity {float(torch.nn.functional.softmax(outputs.logits)[0,1])}")

Probability of toxicity 0.9998181462287903


In [90]:
input_text = 'I love you'
inputs = tokenizer(input_text, truncation=True, return_tensors="pt")
outputs = model(**inputs)
torch.nn.functional.softmax(outputs.logits)
print(f"Probability of toxicity {float(torch.nn.functional.softmax(outputs.logits)[0,1])}")

Probability of toxicity 0.0001617566595086828


In [91]:
input_text = 'I hate you'
inputs = tokenizer(input_text, truncation=True, return_tensors="pt")
outputs = model(**inputs)
torch.nn.functional.softmax(outputs.logits)
print(f"Probability of toxicity {float(torch.nn.functional.softmax(outputs.logits)[0,1])}")

Probability of toxicity 0.999875545501709
