# Attempt 2: huggingface transformers

After doing some research, it is clear that we can also use huggingface to train a model, outputting the probabilities and then performing auc-roc

# Lessons learned

Found that the model was training slow when training with a greater batch size than 8 despite knowing that I had used batch size 32 and 64 on this same machine with a similiar dataset size.  

What I failed to initially realize was that the tokenized data was of a larger token max_length than the previous dataset.  The other set had max_length padding of size 128 while here I was using 512 since the model was bert-base-uncased.  Since the padding was 4x larger, the batch size had to be 4x smaller to fit into the GPU memory. 

# Issues
The subtrain/subval split is not using the same split because i stratify by each of the individual labels. This is not the same as stratifying by the combination of labels, giving different data to be used on the training and validation sets.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split


In [2]:

RUN=1


## evaluation of data preprocessing

One of the biggest things I want to find out is how small I can make the tokenizer truncate the data to without it impacting a majority of the samples.  Since 95% of the data is about 230 words, I will be using 256 for the max_length

In [3]:
#load csv files
train = pd.read_csv('../data/train_split.csv')
valid = pd.read_csv('../data/valid_split.csv')


In [4]:
train.columns

Index(['id', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat',
       'insult', 'identity_hate'],
      dtype='object')

In [5]:
train['word_count'] = train['comment_text'].apply(lambda x: len(str(x).split()))
#get the 95% percentile
train['word_count'].quantile(0.95)

230.0

In [6]:
valid['word_count'] = valid['comment_text'].apply(lambda x: len(str(x).split()))
#get the 95% percentile
valid['word_count'].quantile(0.95)

229.0

## data preprocessing

In [7]:
label_cols = ['toxic',	'severe_toxic'	,'obscene',	'threat',	'insult', 	'identity_hate']

In [8]:
binary_dfs_train = {}
for label in label_cols:
    train_copy = train.copy()
    #stratify binary_dfs_train['toxic'] with train as 3% of the data
    subtrain, _ = train_test_split(train_copy, test_size=0.97, random_state=42, stratify=train_copy[label], )
    binary_dfs_train[label] = subtrain.copy()
    binary_dfs_train[label]['label_name'] = binary_dfs_train[label][label].apply(lambda x: label if x==1 else 'other')
    binary_dfs_train[label] = binary_dfs_train[label][['id', 'comment_text', label]]
    #rename the label column to 'label'
    binary_dfs_train[label].rename(columns={label:'label'}, inplace=True)

In [9]:
binary_dfs_val = {}
for label in label_cols:
    val_copy = valid.copy()
    #stratify binary_dfs_val['toxic'] with 10% of the data
    subvalid,_ = train_test_split(val_copy, test_size=0.9, random_state=42, stratify=val_copy[label] )
    binary_dfs_val[label] = subvalid.copy()
    binary_dfs_val[label]['label_name'] = binary_dfs_val[label][label].apply(lambda x: label if x==1 else 'other')
    binary_dfs_val[label] = binary_dfs_val[label][['id', 'comment_text', label]]
    #rename the label column to 'label'
    binary_dfs_val[label].rename(columns={label:'label'}, inplace=True)

## model training

After a few runs it has become clear that given the current size of the subset of the dataset, the maximum batch size is 64

In [63]:
#train a huggingface classifier
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from transformers import Trainer, TrainingArguments

#instantiate the tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

#set truncation to true
tokenizer.truncation = True

#set max length to 256
tokenizer.max_length = 256

#set padding to max length
tokenizer.padding = 'max_length'

print(tokenizer)

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)


In [11]:
#check if cuda is available
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [12]:
!wandb disabled

W&B disabled.


In [13]:
import datasets


binary_datasets_train = {}
for key in binary_dfs_train.keys():
    
    #convert the data into a Dataset object
    binary_datasets_train[key] = datasets.Dataset.from_pandas(binary_dfs_train[key])
    binary_datasets_train[key] = binary_datasets_train[key].map(lambda batch: tokenizer(batch['comment_text'], truncation=True, padding='max_length',max_length=256), batched=True,batch_size=64)
    
    binary_datasets_train[key].set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
    

Map:   0%|          | 0/3829 [00:00<?, ? examples/s]

Map: 100%|██████████| 3829/3829 [00:00<00:00, 3915.66 examples/s]
Map: 100%|██████████| 3829/3829 [00:00<00:00, 4131.57 examples/s]
Map: 100%|██████████| 3829/3829 [00:00<00:00, 4096.48 examples/s]
Map: 100%|██████████| 3829/3829 [00:00<00:00, 4229.31 examples/s]
Map: 100%|██████████| 3829/3829 [00:00<00:00, 4054.98 examples/s]
Map: 100%|██████████| 3829/3829 [00:00<00:00, 4420.35 examples/s]


In [14]:
binary_datasets_val = {}
for key in binary_dfs_val.keys():


    #convert the data into a Dataset object
    binary_datasets_val[key] = datasets.Dataset.from_pandas(binary_dfs_val[key])
    binary_datasets_val[key] = binary_datasets_val[key].map(lambda batch: tokenizer(batch['comment_text'], truncation=True, padding='max_length',max_length=256), batched=True,batch_size=64)
    
    binary_datasets_val[key].set_format('torch', columns=['input_ids', 'attention_mask', 'label'])


Map:   0%|          | 0/3191 [00:00<?, ? examples/s]

Map: 100%|██████████| 3191/3191 [00:00<00:00, 4509.21 examples/s]
Map: 100%|██████████| 3191/3191 [00:00<00:00, 4463.43 examples/s]
Map: 100%|██████████| 3191/3191 [00:00<00:00, 4508.69 examples/s]
Map: 100%|██████████| 3191/3191 [00:00<00:00, 4586.06 examples/s]
Map: 100%|██████████| 3191/3191 [00:00<00:00, 4170.16 examples/s]
Map: 100%|██████████| 3191/3191 [00:00<00:00, 4510.05 examples/s]


In [15]:
#import trainer and training arguments
from transformers import Trainer, TrainingArguments
for key in binary_datasets_train.keys():
    #instantiate the model and use the label list to define the labels
    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
    model.to(device)

    #instantiate the training arguments
    training_args = TrainingArguments(
        output_dir=f'../results/{key}/{RUN}',          # output directory
        num_train_epochs=3,              # total number of training epochs
        per_device_train_batch_size=64,  # batch size per device during training
        per_device_eval_batch_size=64,   # batch size for evaluation
        learning_rate=5e-05,             # learning rate
        evaluation_strategy='epoch',
        save_strategy='epoch',
        fp16=True,
    )

    #instantiate the trainer
    trainer = Trainer(
        model=model,                         # the instantiated 🤗 Transformers model to be trained
        args=training_args,                  # training arguments, defined above
        train_dataset=binary_datasets_train[key],         # training dataset
        eval_dataset=binary_datasets_val[key],             # evaluation dataset
    )   

    #train the model
    trainer.train()

    #evaluate the model
    trainer.evaluate()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
                                                
 33%|███▎      | 60/180 [00:47<01:09,  1.73it/s]

{'eval_loss': 0.14235129952430725, 'eval_runtime': 10.1537, 'eval_samples_per_second': 314.27, 'eval_steps_per_second': 4.924, 'epoch': 1.0}


                                                 
 67%|██████▋   | 120/180 [01:35<00:34,  1.74it/s]

{'eval_loss': 0.1216760203242302, 'eval_runtime': 10.1661, 'eval_samples_per_second': 313.887, 'eval_steps_per_second': 4.918, 'epoch': 2.0}


                                                 
100%|██████████| 180/180 [02:24<00:00,  1.75it/s]

{'eval_loss': 0.13707205653190613, 'eval_runtime': 10.1756, 'eval_samples_per_second': 313.593, 'eval_steps_per_second': 4.914, 'epoch': 3.0}


100%|██████████| 180/180 [02:26<00:00,  1.23it/s]


{'train_runtime': 149.3333, 'train_samples_per_second': 76.922, 'train_steps_per_second': 1.205, 'train_loss': 0.1055162959628635, 'epoch': 3.0}


100%|██████████| 50/50 [00:10<00:00,  4.82it/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
                                                
 33%|███▎      | 60/180 [00:46<01:09,  1.72it/s]

{'eval_loss': 0.027402380481362343, 'eval_runtime': 10.238, 'eval_samples_per_second': 311.682, 'eval_steps_per_second': 4.884, 'epoch': 1.0}


                                                 
 67%|██████▋   | 120/180 [01:36<00:34,  1.72it/s]

{'eval_loss': 0.029304224997758865, 'eval_runtime': 10.2574, 'eval_samples_per_second': 311.094, 'eval_steps_per_second': 4.875, 'epoch': 2.0}


                                                 
100%|██████████| 180/180 [02:25<00:00,  1.72it/s]

{'eval_loss': 0.02797190472483635, 'eval_runtime': 10.3017, 'eval_samples_per_second': 309.754, 'eval_steps_per_second': 4.854, 'epoch': 3.0}


100%|██████████| 180/180 [02:28<00:00,  1.21it/s]


{'train_runtime': 148.5944, 'train_samples_per_second': 77.304, 'train_steps_per_second': 1.211, 'train_loss': 0.04140154520670573, 'epoch': 3.0}


100%|██████████| 50/50 [00:10<00:00,  4.82it/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
 33%|███▎      | 60/180 [00:36<01:09,  1.73it/s]
 33%|███▎      | 60/180 [00:47<01:09,  1.73it/s]

{'eval_loss': 0.06302615255117416, 'eval_runtime': 10.2912, 'eval_samples_per_second': 310.07, 'eval_steps_per_second': 4.859, 'epoch': 1.0}


 67%|██████▋   | 120/180 [01:26<00:34,  1.72it/s]
 67%|██████▋   | 120/180 [01:36<00:34,  1.72it/s]

{'eval_loss': 0.06792459636926651, 'eval_runtime': 10.276, 'eval_samples_per_second': 310.529, 'eval_steps_per_second': 4.866, 'epoch': 2.0}


                                                 
100%|██████████| 180/180 [02:26<00:00,  1.74it/s]

{'eval_loss': 0.07258742302656174, 'eval_runtime': 10.2083, 'eval_samples_per_second': 312.59, 'eval_steps_per_second': 4.898, 'epoch': 3.0}


100%|██████████| 180/180 [02:29<00:00,  1.21it/s]


{'train_runtime': 149.0198, 'train_samples_per_second': 77.084, 'train_steps_per_second': 1.208, 'train_loss': 0.060623889499240455, 'epoch': 3.0}


100%|██████████| 50/50 [00:10<00:00,  4.82it/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
                                                
 33%|███▎      | 60/180 [00:46<01:09,  1.73it/s]

{'eval_loss': 0.018977897241711617, 'eval_runtime': 10.2293, 'eval_samples_per_second': 311.946, 'eval_steps_per_second': 4.888, 'epoch': 1.0}


                                                 
 67%|██████▋   | 120/180 [01:34<00:34,  1.75it/s]

{'eval_loss': 0.01690705306828022, 'eval_runtime': 10.0918, 'eval_samples_per_second': 316.198, 'eval_steps_per_second': 4.955, 'epoch': 2.0}


                                                 
100%|██████████| 180/180 [02:22<00:00,  1.75it/s]

{'eval_loss': 0.01664489507675171, 'eval_runtime': 10.1524, 'eval_samples_per_second': 314.311, 'eval_steps_per_second': 4.925, 'epoch': 3.0}


100%|██████████| 180/180 [02:24<00:00,  1.25it/s]


{'train_runtime': 144.2657, 'train_samples_per_second': 79.624, 'train_steps_per_second': 1.248, 'train_loss': 0.03548899491628011, 'epoch': 3.0}


100%|██████████| 50/50 [00:10<00:00,  4.98it/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
 33%|███▎      | 60/180 [00:35<01:08,  1.75it/s]
 33%|███▎      | 60/180 [00:46<01:08,  1.75it/s]

{'eval_loss': 0.09803038090467453, 'eval_runtime': 10.1472, 'eval_samples_per_second': 314.472, 'eval_steps_per_second': 4.927, 'epoch': 1.0}


 67%|██████▋   | 120/180 [01:23<00:34,  1.74it/s]
 67%|██████▋   | 120/180 [01:34<00:34,  1.74it/s]

{'eval_loss': 0.10236942023038864, 'eval_runtime': 10.1364, 'eval_samples_per_second': 314.805, 'eval_steps_per_second': 4.933, 'epoch': 2.0}


100%|██████████| 180/180 [02:11<00:00,  1.75it/s]
100%|██████████| 180/180 [02:21<00:00,  1.75it/s]

{'eval_loss': 0.09992678463459015, 'eval_runtime': 10.192, 'eval_samples_per_second': 313.088, 'eval_steps_per_second': 4.906, 'epoch': 3.0}


100%|██████████| 180/180 [02:23<00:00,  1.25it/s]


{'train_runtime': 143.7367, 'train_samples_per_second': 79.917, 'train_steps_per_second': 1.252, 'train_loss': 0.08228690889146593, 'epoch': 3.0}


100%|██████████| 50/50 [00:10<00:00,  4.97it/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
 33%|███▎      | 60/180 [00:36<01:08,  1.75it/s]
 33%|███▎      | 60/180 [00:46<01:08,  1.75it/s]

{'eval_loss': 0.04075108468532562, 'eval_runtime': 10.1109, 'eval_samples_per_second': 315.599, 'eval_steps_per_second': 4.945, 'epoch': 1.0}


 67%|██████▋   | 120/180 [01:24<00:34,  1.74it/s]
 67%|██████▋   | 120/180 [01:34<00:34,  1.74it/s]

{'eval_loss': 0.02863573096692562, 'eval_runtime': 10.1368, 'eval_samples_per_second': 314.792, 'eval_steps_per_second': 4.933, 'epoch': 2.0}


100%|██████████| 180/180 [02:12<00:00,  1.74it/s]
100%|██████████| 180/180 [02:22<00:00,  1.74it/s]

{'eval_loss': 0.027646707370877266, 'eval_runtime': 10.1401, 'eval_samples_per_second': 314.692, 'eval_steps_per_second': 4.931, 'epoch': 3.0}


100%|██████████| 180/180 [02:24<00:00,  1.25it/s]


{'train_runtime': 144.243, 'train_samples_per_second': 79.636, 'train_steps_per_second': 1.248, 'train_loss': 0.04721676508585612, 'epoch': 3.0}


100%|██████████| 50/50 [00:10<00:00,  4.97it/s]


# Obtaining predictions on the test set

In [25]:
#load in the test data
test = pd.read_csv('../data/test.csv.zip')

In [32]:
test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [70]:
#load the local model
model = AutoModelForSequenceClassification.from_pretrained(f'../results/toxic/{RUN}/checkpoint-180')
model.to(device)
#load the pipeline
from transformers import TextClassificationPipeline
from tqdm import tqdm
#tqdm.pandas()
#convert the test data into a Dataset object
test_dataset = datasets.Dataset.from_pandas(test)
test_dataset = test_dataset.map(lambda batch: tokenizer(batch['comment_text'], truncation=True, padding='max_length',max_length=256), batched=True,batch_size=64)
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask'])
#instantiate the pipeline
pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer,device=device, top_k=None)
#test[label] = test['comment_text'].progress_apply(lambda x: pipeline(x)[0][0]['score'])

Map: 100%|██████████| 153164/153164 [01:09<00:00, 2194.28 examples/s]


In [72]:
#using the pipeline, predict on the test data

for label in tqdm(label_cols):
    #track progress using tqdm
    test[label] = test_dataset.map(lambda example: pipeline(example['comment_text'])[0][0]['score'], batched=True, batch_size=64)

Map:   0%|          | 0/153164 [00:00<?, ? examples/s]
  0%|          | 0/6 [00:39<?, ?it/s]


RuntimeError: The size of tensor a (1008) must match the size of tensor b (512) at non-singleton dimension 1

In [None]:
"""
#print(test['comment_text'][0])
#print(test['comment_text'].iloc[0:2])
#make 1 prediction on the test data
print(pipeline([test['comment_text'][0],test['comment_text'][0]]))
print(pipeline(test['comment_text'][0]))

print(pipeline([test['comment_text'][0],test['comment_text'][0]])[0])
print(pipeline(test['comment_text'][0])[0][0])

#make predictions and track using tqdm
tqdm.pandas()
test['preds'] = test['comment_text'].progress_apply(lambda x: pipeline(x))
"""

In [None]:
#load in the test data
test = pd.read_csv('../data/test.csv.zip')

#instantiate the tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
for label in  label_cols:
    #load the local model
    model = AutoModelForSequenceClassification.from_pretrained(f'../results/{label}/{RUN}/checkpoint-180')
    model.to(device)
    #load the pipeline
    from transformers import TextClassificationPipeline
    pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)
    #make predictions
    test[label] = test['comment_text'].apply(lambda x: pipeline(x)[0]['score'])
    

In [23]:
#For each model, load it from local and evaluate it on the test set
from transformers import AutoModelForSequenceClassification
import torch
from transformers import Trainer, TrainingArguments
test = pd.read_csv('../data/test.csv.zip')
#tokenize the test set
test_dataset = datasets.Dataset.from_pandas(test)
test_dataset = test_dataset.map(lambda batch: tokenizer(batch['comment_text'], truncation=True, padding='max_length',max_length=256), batched=True,batch_size=64)
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask'])

for key in binary_datasets_train.keys():
    #load the local model
    model = AutoModelForSequenceClassification.from_pretrained(f'../results/{key}/{RUN}/checkpoint-180')
    model.to(device)
    #instantiate the trainer
    trainer = Trainer(
        model=model,                         # the instantiated 🤗 Transformers model to be trained
        args=training_args,                  # training arguments, defined above
    )
    #run predictions on the test set with proba
    predictions = trainer.predict(test_dataset)
    #get the predicted class
    predictions = np.argmax(predictions.predictions, axis=1)
    #save the predictions
    test[f'{key}_pred'] = predictions
    #save the labels
    
    break
predictions

Map: 100%|██████████| 153164/153164 [00:51<00:00, 2985.51 examples/s]
100%|██████████| 2394/2394 [08:05<00:00,  4.93it/s]


PredictionOutput(predictions=array([[-2.2597656 ,  2.2558594 ],
       [ 3.34375   , -2.8144531 ],
       [ 3.2167969 , -2.7324219 ],
       ...,
       [ 3.3945312 , -2.8359375 ],
       [ 3.2402344 , -2.7460938 ],
       [-0.69873047,  0.9082031 ]], dtype=float32), label_ids=None, metrics={'test_runtime': 485.7119, 'test_samples_per_second': 315.339, 'test_steps_per_second': 4.929})

In [None]:
for key in binary_datasets_train.keys():
    #load the local model
    model = AutoModelForSequenceClassification.from_pretrained(f'../results/{key}/{RUN}/checkpoint-180')
    model.to(device)
    #instantiate the trainer
    trainer = Trainer(
        model=model,                         # the instantiated 🤗 Transformers model to be trained
        args=training_args,                  # training arguments, defined above
    )
    #run predictions on the test set
    predictions = trainer.predict(test_dataset)
    break
predictions