# Dataset_Tokenization for Fine-tuning DistilBERT for Binary Text clf
The discord bot will send all mesages incoming into the server into the classifier that was trained in this notebook. This classifier will check if the basic rules of the discord server are bing followed. 

This model is being trained ot follow the level 1 rules:

1.) No Offensive Language

2.) No Spam

3.) No Threats 

In this notebook I am just performing the tokenization process and dataset create for the initial classifier iteration prior to training with sagemaker.

### Loading pre-trained model and tokenizer
The idea behind using distilbert is that it is a lighter weight model than BERT and others. This classifier will serve as the initial check of all messages in the discord server. For this reason it does not have to be perfect. In fact, I am going to be very lenient of lower recall scores being traded off for higher precision. To mitigate the annoyance of being mistakenly flagged as innapropriate, though, a repeal feature will be added to the bot that will then run the messages through a heavier weight, more competent model.

In [4]:
import transformers
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification

model_path = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
model = DistilBertForSequenceClassification.from_pretrained(model_path, id2label={0:"NEG", 1:"POS"}, label2id={"NEG":0, "POS":1}) 

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Load Wikipedia Toxic Comments Dataset

In [5]:
import pandas as pd

train = pd.read_csv('processed_train.csv')
test = pd.read_csv('processed_test.csv')
val = pd.read_csv('processed_val.csv')

print(f'Train shape: {train.shape}')
print(f'Test shape: {test.shape}')
print(f'Val shape: {val.shape}')

Train shape: (110000, 2)
Test shape: (25000, 2)
Val shape: (24571, 2)


In [6]:
train.head()

Unnamed: 0,comment_text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


## Encoding w/ DistilBERT Tokenizer

In [7]:
import pandas as pd
import torch
from transformers import DistilBertTokenizerFast

# Load the tokenizer
model_path = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)

def encode(comment, label):
    encoded = tokenizer(comment, truncation=True, padding='max_length', max_length=128, return_tensors="pt")
    
    # Convert 0d tensors to python numbers using .item() for each element in vector
    attention_mask = [i.item() for i in encoded['attention_mask'][0]]
    input_ids = [i.item() for i in encoded['input_ids'][0]]
    label = label.item() if isinstance(label, torch.Tensor) else label
    
    # Return data in a dictionary format
    return {
        'attention_mask': attention_mask,
        'input_ids': input_ids,
        'label': label,
        'text': comment
    }

def transform_to_dataframe(df):
    # Apply the encode function to each row of the dataframe
    encoded_data = df.apply(lambda row: encode(row['comment_text'], row['toxic']), axis=1)
    
    # Convert encoded data to a list of dictionaries
    list_of_dicts = [item for item in encoded_data]
    
    # Convert list of dictionaries to dataframe
    return pd.DataFrame(list_of_dicts)

# Assuming `train`, `test`, and `val` dataframes are defined and loaded
train_df = transform_to_dataframe(train)
test_df = transform_to_dataframe(test)
val_df = transform_to_dataframe(val)


In [8]:
train_df.head()

Unnamed: 0,attention_mask,input_ids,label,text
0,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[101, 7526, 2339, 1996, 10086, 2015, 2081, 210...",0,Explanation\nWhy the edits made under my usern...
1,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[101, 1040, 1005, 22091, 2860, 999, 2002, 3503...",0,D'aww! He matches this background colour I'm s...
2,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[101, 4931, 2158, 1010, 1045, 1005, 1049, 2428...",0,"Hey man, I'm really not trying to edit war. It..."
3,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[101, 1000, 2062, 1045, 2064, 1005, 1056, 2191...",0,"""\nMore\nI can't make any real suggestions on ..."
4,"[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[101, 2017, 1010, 2909, 1010, 2024, 2026, 5394...",0,"You, sir, are my hero. Any chance you remember..."


### Converting pandas dataframes to datasets.Dataset objects for training

In [10]:
import datasets

train_dataset = datasets.Dataset.from_pandas(train_df)
test_dataset = datasets.Dataset.from_pandas(test_df)
val_dataset = datasets.Dataset.from_pandas(val_df)

In [11]:
train_dataset.save_to_disk("train_dataset")
test_dataset.save_to_disk("test_dataset")
val_dataset.save_to_disk("val_dataset")

Saving the dataset (0/1 shards):   0%|          | 0/110000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/25000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/24571 [00:00<?, ? examples/s]