Load the Dataset: Load the mendeley file containing your dataset into a pandas DataFrame or any appropriate data structure in Python.

In [1]:
from tokenizers import BertWordPieceTokenizer
import pandas as pd

In [3]:
df = pd.read_csv("../../data/mendeley/HateSpeechDatasetBalanced.csv")
df.columns = df.columns.str.lower()

df.head()

Unnamed: 0,content,label
0,denial of normal the con be asked to comment o...,1
1,just by being able to tweet this insufferable ...,1
2,that is retarded you too cute to be single tha...,1
3,thought of a real badass mongol style declarat...,1
4,afro american basho,1


Pre-processing: Perform any necessary pre-processing steps on the tweets, such as cleaning, tokenization, and normalization. You may also need to handle any missing values or outliers in the dataset.

In [6]:
# Write tweets to a text file
with open("../../preprocessed/mendeley_bert_formatted_data.txt", "w", encoding="utf-8") as file:
    for tweet in df["content"]:
        file.write(tweet + "\n")

Train Custom BERT Tokenizer: Use the pre-processed dataset to train a custom BERT tokenizer specifically tailored for your hate speech classification task. This tokenizer will convert the tweets into BERT-compatible tokens.

In [13]:
import os
from tokenizers import BertWordPieceTokenizer

# Initialize the tokenizer with desired parameters
tokenizer = BertWordPieceTokenizer(
    clean_text=True,  # Clean text before tokenization
    handle_chinese_chars=True,
    strip_accents=False,
    lowercase=False,  # Keep case-sensitive
    wordpieces_prefix="##"
)

# Train the tokenizer
tokenizer.train(
    files=["../../preprocessed/mendeley_bert_formatted_data.txt"],  # Path to formatted data
    vocab_size=50000,  # Vocabulary size
    min_frequency=2,   # Minimum frequency to include a token in vocabulary
    limit_alphabet=1000,  # Limit alphabet characters during training
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]  # Special tokens to include
)

save_dir = "../../custom_bert_tokenizer_mendeley"
os.makedirs(save_dir, exist_ok=True)

# Save the trained tokenizer
tokenizer.save_model(save_dir)

['../../custom_bert_tokenizer_mendeley\\vocab.txt']

Load the custom tokenizer to ensure it correctly tokenizes the tweets into BERT-compatible tokens.

In [14]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("../../custom_bert_tokenizer_mendeley")

# Test the tokenizer on a sample tweet
sample_tweet = "This is a sample bad tweet for testing."
tokenized_input = tokenizer.tokenize(sample_tweet)
print("Tokenized input:", tokenized_input)

Tokenized input: ['this', 'is', 'a', 'sample', 'bad', 'tweet', 'for', 'testing', '.']


Convert the pre-processed text data (tweets) into tokens using the custom tokenizer.

In [17]:
from transformers import BertTokenizer
import pandas as pd

tokenizer = BertTokenizer.from_pretrained("../../custom_bert_tokenizer_mendeley")

# Tokenization function
def tokenize_text(text):
    return tokenizer.tokenize(text)

df["tokenized_text"] = df["content"].apply(tokenize_text)

# Save tokenized data
df.to_csv("..\\..\\tokenized_dataset\\tokenized_data_mendeley.csv", index=False)  