Load the Dataset: Load the .pkl file containing your dataset into a pandas DataFrame or any appropriate data structure in Python.

In [1]:
from tokenizers import BertWordPieceTokenizer
import pandas as pd

In [12]:
df = pd.read_pickle('../preprocessed/tweets_bert.pkl')
df = df.drop(columns=['tokens'])

df.head()

Unnamed: 0,tweet,label
0,as a woman you shouldnt complain about cleanin...,0
1,boy dats coldtyga dwn bad for cuffin dat hoe i...,0
2,dawg you ever fuck a bitch and she sta to cry ...,0
3,she look like a tranny,0
4,the shit you hear about me might be true or it...,0


Pre-processing: Perform any necessary pre-processing steps on the tweets, such as cleaning, tokenization, and normalization. You may also need to handle any missing values or outliers in the dataset.

In [4]:
# Write tweets to a text file
with open("../preprocessed/bert_formatted_data.txt", "w", encoding="utf-8") as file:
    for tweet in df["tweet"]:
        file.write(tweet + "\n")

Train Custom BERT Tokenizer: Use the pre-processed dataset to train a custom BERT tokenizer specifically tailored for your hate speech classification task. This tokenizer will convert the tweets into BERT-compatible tokens.

In [8]:
import os
from tokenizers import BertWordPieceTokenizer

# Initialize the tokenizer with desired parameters
tokenizer = BertWordPieceTokenizer(
    clean_text=True,  # Clean text before tokenization
    handle_chinese_chars=True,
    strip_accents=False,
    lowercase=False,  # Keep case-sensitive
    wordpieces_prefix="##"
)

# Train the tokenizer
tokenizer.train(
    files=["../preprocessed/bert_formatted_data.txt"],  # Path to formatted data
    vocab_size=30000,  # Vocabulary size
    min_frequency=2,   # Minimum frequency to include a token in vocabulary
    limit_alphabet=1000,  # Limit alphabet characters during training
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]  # Special tokens to include
)

save_dir = "..\\custom_bert_tokenizer"
os.makedirs(save_dir, exist_ok=True)

# Save the trained tokenizer
tokenizer.save_model(save_dir)

['..\\custom_bert_tokenizer\\vocab.txt']

Load the custom tokenizer to ensure it correctly tokenizes the tweets into BERT-compatible tokens.

In [10]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("..\\custom_bert_tokenizer")

# Test the tokenizer on a sample tweet
sample_tweet = "This is a sample tweet for testing."
tokenized_input = tokenizer.tokenize(sample_tweet)
print("Tokenized input:", tokenized_input)

Tokenized input: ['this', 'is', 'a', 'sample', 'tweet', 'for', 'testing', '[UNK]']


Convert the pre-processed text data (tweets) into tokens using the custom tokenizer.

In [14]:
from transformers import BertTokenizer
import pandas as pd

tokenizer = BertTokenizer.from_pretrained("..\\custom_bert_tokenizer")

# Tokenization function
def tokenize_text(text):
    return tokenizer.tokenize(text)

df["tokenized_text"] = df["tweet"].apply(tokenize_text)

# Save tokenized data
df.to_csv("..\\tokenized_dataset\\tokenized_data.csv", index=False)  

Define the parameters for the custom BERT model, including the vocabulary size, hidden size, number of layers, attention heads, and maximum position embeddings.

Define Training Arguments: Specify the training arguments for training the custom BERT model, such as the output directory, number of epochs, batch size, and evaluation settings.

Train the BERT Model: Train the custom BERT model using the specified training arguments and input data (tweet tokens).

Evaluate the Model: Evaluate the performance of the trained model on a validation or test dataset to assess its accuracy and effectiveness in classifying hate speech.

Model Inference: Once the model is trained and evaluated, you can use it to make predictions on new tweets to determine whether they contain hate speech or not.