# Tokenizer Training for NER labeling job

I ran into an issue with my the tokenizer I was using when attempting to use the fine tuned NER model to label the wikipedia toxic comments dataset. That issue was that the tokenizer was not used to the innapropriate words found within the wikipedia dataset, because of that the BertTokenizerFast('bert-base-uncased') tokenizer I was using would subword tokenize entities it as not familiar with. This poses a major problem and rendered the labeling process innefective. 

To address this issue, in this notebook I train two new tokenizers on the combined corpus of both conll2003 dataset and the wikipedia toxic comments dataset. One that includes entire conll2003 and wiki dataset. And another that include the enitre conll2003 and only the toxic examples from the wiki set. These tokenizers will be used to fine-tune the base distilbert model.

The reason that I must fine-tune the entire model again is because using a different tokenizer majorly impacts the performance of the model. This is because the input ids, the numbers the model recieves as input, will be entirely different.

# Tokenizer training Strategy:
The first step is to gather up the text data from the conll2003 and the wiki set into a single corpus.txt file
Then the training is simple from there and can be done in a few lines of code.

In [72]:
import datasets

conll2003 = datasets.load_dataset("conll2003")

Here I am converting the train, test, and val sets to pandas dfs for easier manipulation

In [73]:
import pandas

#extract out the train, validation, and test sets --- they are in the format of a dataset sequences
train, validation, test = conll2003["train"], conll2003["validation"], conll2003["test"]

#convert the datasets to pandas dataframes
train, validation, test = [x.to_pandas() for x in [train, validation, test]]

train.head()

Unnamed: 0,id,tokens,pos_tags,chunk_tags,ner_tags
0,0,"[EU, rejects, German, call, to, boycott, Briti...","[22, 42, 16, 21, 35, 37, 16, 21, 7]","[11, 21, 11, 12, 21, 22, 11, 12, 0]","[3, 0, 7, 0, 0, 0, 7, 0, 0]"
1,1,"[Peter, Blackburn]","[22, 22]","[11, 12]","[1, 2]"
2,2,"[BRUSSELS, 1996-08-22]","[22, 11]","[11, 12]","[5, 0]"
3,3,"[The, European, Commission, said, on, Thursday...","[12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 3...","[11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 1...","[0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, ..."
4,4,"[Germany, 's, representative, to, the, Europea...","[22, 27, 21, 35, 12, 22, 22, 27, 16, 21, 22, 2...","[11, 11, 12, 13, 11, 12, 12, 11, 12, 12, 12, 1...","[5, 0, 0, 0, 0, 3, 4, 0, 0, 0, 1, 2, 0, 0, 0, ..."


Next I am converting the df into a list of white space split tokens

In [74]:
#this results int a list of numpy arrays of lists of tokens [np.array(['a', 'b'], ['c', 'd'])]
train_tokens, test_tokens, val_tokens = [x["tokens"].to_list() for x in [train, test, validation]]

#map conversion of token examples to a list of lists to all datasets --- result: [['a', 'b'], ['c', 'd']]
train_tokens, test_tokens, val_tokens = [list(map(list, x)) for x in [train_tokens, test_tokens, val_tokens]]

And finally I am concatenating them to a single list

In [75]:
# move all lists into one list
all_tokens_conll2003 = train_tokens + test_tokens + val_tokens

Now I am going to do the same thign with the wikipedia comments dataset

In [76]:
import os

dir_path = r"C:\Users\hunte\OneDrive\Documents\Coding Projects\Bot_Discord_Proj\Original_Folder\data"
paths = [os.path.join(dir_path, data_split) for data_split in ["train.csv", "test.csv"]] #path to train.csv

train, test = [pandas.read_csv(path) for path in paths]

train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


Below I am converting the pandas dfs for both datasets into lists of white space seperated tokens.

In [116]:
#extract "comment_text" out from df and convert to list of strings
train_text, test_text = [x["comment_text"].to_list() for x in [train, test]]

#map conversions of lists of strings ----> lists of white space split tokens
train_tokens, test_tokens = [list(map(str.split, x)) for x in [train_text, test_text]]

#concatenate the train and test tokens
all_tokens_wiki = train_tokens + test_tokens

print(f'number of examples form conll2003: {len(all_tokens_conll2003)}') 
print(f'number of examples from wiki set: {len(all_tokens_wiki)}\n')

#sum lengths of examples across datasets
non_unique_tokens_conll2003 = sum([len(example) for example in all_tokens_conll2003])
non_unique_tokens_wiki = sum([len(example) for example in all_tokens_wiki])

print(f'number of non-unique tokens from conll2003: {non_unique_tokens_conll2003}')
print(f'number of non-unique tokens from wiki set: {non_unique_tokens_wiki}\n')
print(f'total non-unique tokens: {non_unique_tokens_conll2003 + non_unique_tokens_wiki:,}')

number of examples form conll2003: 20744
number of examples from wiki set: 312735

number of non-unique tokens from conll2003: 301418
number of non-unique tokens from wiki set: 20171453

total non-unique tokens: 20,472,871


Below I am combining the two lists of tokens together into a single list of lists of tokens

In [120]:
#concatenate the two datasets
all_tokens = all_tokens_conll2003 + all_tokens_wiki

In [121]:
len(all_tokens)

333479

To determine the number of unique tokens I am going to convert all_tokens to a pandas series and apply .unique() to it.

In [127]:
# Flatten the list of lists
tokens_list = [token for example in all_tokens for token in example]

# Convert the flattened list to a Series
s = pd.Series(tokens_list)

# Find unique values
unique_values = s.unique()

len(unique_values)

972712

There are nearly 1,000,000 unique tokens. During training, the tokenizer will judge which words and subwords to include in the vocabulary based on frequency. I am a bit concerned that perhaps some toxic words will be subword tokenized or shafted into the [UNK] token.

I will keep this in mind. I am thinking that I will also create a corupus that has a greater emphasis on the toxic vocab from the wiki set and foregoing the clean examples. 

Below I am going to write all_tokens to a corpus.txt file woth one example per line

In [137]:
import os
os.getcwd()

'c:\\Users\\hunte\\OneDrive\\Documents\\Coding Projects\\Bot_Discord_Proj\\NER_model'

In [141]:
with open("conll_wiki_corpus.txt", "wb") as f:
    for i in range(len(all_tokens)):
        #encode joined tokens to utf-8 and write to file
        example = (" ".join(all_tokens[i]) + "\n").encode("utf-8")
        f.write(example)



# Train Tokenizer with Corpus

The tokenizer training process will determine what tokens from the corpus are included in its standard vocabulary, which to subword tokenizer.

In [147]:
from tokenizers import BertWordPieceTokenizer

# Initialize a tokenizer
tokenizer = BertWordPieceTokenizer()

#path to corpus 
corpus_path = os.path.join(os.getcwd(), "conll_wiki_corpus.txt")

# Train the tokenizer
tokenizer.train(
    files=[corpus_path],  
    vocab_size=30000,  # 30,000 is a standard value for BERT
    min_frequency=2,  # Minimum frequency for a token to be included in vocab
    show_progress=True,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
    limit_alphabet=1000,  # Limit the alphabet size
    wordpieces_prefix="##"  # Prefix for subwords
)

# Save the tokenizer
tokenizer.save_model(os.getcwd(), "conll_wiki_tokenizer")

['c:\\Users\\hunte\\OneDrive\\Documents\\Coding Projects\\Bot_Discord_Proj\\NER_model\\conll_wiki_tokenizer-vocab.txt']

# Alternative Tokenizer
I will also, quickly, run through the entire process of constructing the corpus and training the tokenizer again, but this time without the non-toxic examples form the wiki dataset. This is to address my concern that the toxic words from the wiki set will be lost to subword tokenization by more frequent friendly words from that set. This concern will be ab tested by fine tuning a model with each tokenizer.

I won't explain the code below beyond the comments since it is the same process as above.

In [149]:
len(all_tokens_conll2003)

20744

Build Corpus

In [165]:
#load wiki set
dir_path = r"C:\Users\hunte\OneDrive\Documents\Coding Projects\Bot_Discord_Proj\Original_Folder\data"
path = os.path.join(dir_path, "train.csv") #path to train.csv

#only train contains labels
train = pandas.read_csv(path)

toxic_cols = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

#if any of the labels are 1, then the example is toxic and isolate it into a new dataframe
tox_df = train[ train[toxic_cols].any(axis=1).astype(int).eq(1) ]

#extract "comment_text" out from df and convert to list of strings
train_text = tox_df["comment_text"].to_list()

#map conversions of lists of strings ----> lists of white space split tokens
train_tokens = list(map(str.split, train_text))

#concatenate the two datasets
all_tokens = all_tokens_conll2003 + train_tokens
print(f'Examples in all_tokens: {len(all_tokens)}')

#write corpus to file
with open("conll_toxic_wiki_corpus.txt", "wb") as f:
    for i in range(len(all_tokens)):
        #encode joined tokens to utf-8 and write to file
        example = (" ".join(all_tokens[i]) + "\n").encode("utf-8")
        f.write(example)


Examples in all_tokens: 36969


Train Tokenizer

In [166]:
from tokenizers import BertWordPieceTokenizer

# Initialize a tokenizer
tokenizer = BertWordPieceTokenizer()

#path to corpus 
corpus_path = os.path.join(os.getcwd(), "conll_toxic_wiki_corpus.txt")

# Train the tokenizer
tokenizer.train(
    files=[corpus_path],  
    vocab_size=30000,  # 30,000 is a standard value for BERT
    min_frequency=2,  # Minimum frequency for a token to be included in vocab
    show_progress=True,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
    limit_alphabet=1000,  # Limit the alphabet size
    wordpieces_prefix="##"  # Prefix for subwords
)

# Save the tokenizer
tokenizer.save_model(os.getcwd(), "conll_toxic_wiki_tokenizer")

['c:\\Users\\hunte\\OneDrive\\Documents\\Coding Projects\\Bot_Discord_Proj\\NER_model\\conll_toxic_wiki_tokenizer-vocab.txt']