# Attempt 2: huggingface transformers

After doing some research, it is clear that we can also use huggingface to train a model, outputting the probabilities and then performing auc-roc

# Lessons learned

Found that the model was training slow when training with a greater batch size than 8 despite knowing that I had used batch size 32 and 64 on this same machine with a similiar dataset size.  

What I failed to initially realize was that the tokenized data was of a larger token max_length than the previous dataset.  The other set had max_length padding of size 128 while here I was using 512 since the model was bert-base-uncased.  Since the padding was 4x larger, the batch size had to be 4x smaller to fit into the GPU memory. 

# Issues
The subtrain/subval split is not using the same split because i stratify by each of the individual labels. This is not the same as stratifying by the combination of labels, giving different data to be used on the training and validation sets.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split


In [2]:

RUN=1
#load csv files
train = pd.read_csv('../data/train_split.csv')
valid = pd.read_csv('../data/valid_split.csv')

## data preprocessing

In [3]:
label_cols = ['toxic',	'severe_toxic'	,'obscene',	'threat',	'insult', 	'identity_hate']

In [4]:
binary_dfs_train = {}
for label in label_cols:
    train_copy = train.copy()
    #stratify binary_dfs_train['toxic'] with train as 3% of the data
    subtrain, _ = train_test_split(train_copy, test_size=0.97, random_state=42, stratify=train_copy[label], )
    binary_dfs_train[label] = subtrain.copy()
    binary_dfs_train[label]['label_name'] = binary_dfs_train[label][label].apply(lambda x: label if x==1 else 'other')
    binary_dfs_train[label] = binary_dfs_train[label][['id', 'comment_text', label]]
    #rename the label column to 'label'
    binary_dfs_train[label].rename(columns={label:'label'}, inplace=True)

In [24]:
binary_dfs_train['severe_toxic']['text_word_count'] = binary_dfs_train['severe_toxic']['comment_text'].apply(lambda x: len(x.split()))
binary_dfs_train['severe_toxic'][binary_dfs_train['severe_toxic']['text_word_count']>500].shape

(42, 4)

In [5]:
binary_dfs_val = {}
for label in label_cols:
    val_copy = valid.copy()
    #stratify binary_dfs_val['toxic'] with 10% of the data
    subvalid,_ = train_test_split(val_copy, test_size=0.9, random_state=42, stratify=val_copy[label] )
    binary_dfs_val[label] = subvalid.copy()
    binary_dfs_val[label]['label_name'] = binary_dfs_val[label][label].apply(lambda x: label if x==1 else 'other')
    binary_dfs_val[label] = binary_dfs_val[label][['id', 'comment_text', label]]
    #rename the label column to 'label'
    binary_dfs_val[label].rename(columns={label:'label'}, inplace=True)

## model training

In [6]:
#train a huggingface classifier
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from transformers import Trainer, TrainingArguments

#instantiate the tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
#check if cuda is available
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [8]:
!wandb disabled

W&B disabled.


In [22]:
import datasets


binary_datasets_train = {}
for key in binary_dfs_train.keys():
    
    #convert the data into a Dataset object
    binary_datasets_train[key] = datasets.Dataset.from_pandas(binary_dfs_train[key])
    binary_datasets_train[key] = binary_datasets_train[key].map(lambda batch: tokenizer(batch['comment_text'], truncation=True, padding='max_length',max_length=128), batched=True,batch_size=64)
    
    binary_datasets_train[key].set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
    

Token indices sequence length is longer than the specified maximum sequence length for this model (1380 > 512). Running this sequence through the model will result in indexing errors


Map: 100%|██████████| 3829/3829 [00:00<00:00, 9723.16 examples/s]
Map: 100%|██████████| 3829/3829 [00:00<00:00, 9754.20 examples/s]
Map: 100%|██████████| 3829/3829 [00:00<00:00, 9716.55 examples/s]
Map: 100%|██████████| 3829/3829 [00:00<00:00, 9545.50 examples/s]
Map: 100%|██████████| 3829/3829 [00:00<00:00, 10024.99 examples/s]
Map: 100%|██████████| 3829/3829 [00:00<00:00, 9205.07 examples/s]


In [13]:
binary_datasets_val = {}
for key in binary_dfs_val.keys():


    #convert the data into a Dataset object
    binary_datasets_val[key] = datasets.Dataset.from_pandas(binary_dfs_val[key])
    binary_datasets_val[key] = binary_datasets_val[key].map(lambda batch: tokenizer(batch['comment_text'], truncation=True, padding='max_length',max_length=128), batched=True,batch_size=64)
    
    binary_datasets_val[key].set_format('torch', columns=['input_ids', 'attention_mask', 'label'])




Map: 100%|██████████| 3191/3191 [00:00<00:00, 7194.52 examples/s]
Map: 100%|██████████| 3191/3191 [00:00<00:00, 7588.55 examples/s]
Map: 100%|██████████| 3191/3191 [00:00<00:00, 7322.71 examples/s]
Map: 100%|██████████| 3191/3191 [00:00<00:00, 7109.12 examples/s]
Map: 100%|██████████| 3191/3191 [00:00<00:00, 7199.08 examples/s]
Map: 100%|██████████| 3191/3191 [00:00<00:00, 7710.70 examples/s]


In [11]:
#import trainer and training arguments
from transformers import Trainer, TrainingArguments
for key in binary_datasets_train.keys():
    #instantiate the model and use the label list to define the labels
    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
    model.to(device)

    #instantiate the training arguments
    training_args = TrainingArguments(
        output_dir=f'../results/{key}/{RUN}',          # output directory
        num_train_epochs=3,              # total number of training epochs
        per_device_train_batch_size=64,  # batch size per device during training
        per_device_eval_batch_size=64,   # batch size for evaluation
        learning_rate=5e-05,             # learning rate
        evaluation_strategy='epoch',
        save_strategy='epoch',
        fp16=True,
    )

    #instantiate the trainer
    trainer = Trainer(
        model=model,                         # the instantiated 🤗 Transformers model to be trained
        args=training_args,                  # training arguments, defined above
        train_dataset=binary_datasets_train[key],         # training dataset
        eval_dataset=binary_datasets_val[key],             # evaluation dataset
    )   

    #train the model
    trainer.train()

    #evaluate the model
    trainer.evaluate()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
 22%|██▏       | 80/360 [00:13<00:45,  6.15it/s]

KeyboardInterrupt: 