# Tokenizers of Transformers

After the initial introduction of the original transformer, it evolved so quickly within short amount of time. With each of those iterations, transformers changed the way it process the input data. Hence it is important to understand the preprocessing need to be done to each of the model. 

One important aspect of this preprocessing is tokenization. Lets say we used a tokenizer trained on a generic news articles to process some advance science article may be biology. In that case there's very high chance the tokenizer will do very bad things to the tokens we need. So it is important to choose the right type of tokenizer for our usecase/datasets.

First we will look at the latest preprocessing best practices. These are found to be effective when training transformers.


1. Only choose sentences with punctuation marks. (If we need to teach a machine about our language, we first show it what is proper language.)
2. Remove bad words. (this is due to obvious reasons)
3. Remove code (this may depend on the task we are doing, but basically removing numerical components is a good idea.)
4. Check Language (in somecases datasets may contain multiple languages in one sections. This may cause problems.)
5. Grammatical check (some online datasets may contain sentences that makes no sense. It is better to filter out such data.)


Other than above in commercial applications it is better to check the data for any discrimanations, bad informations etc and remove them. Otherwise it may cause problems down the line from ethical and legal perspectives.

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ddsdi\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [3]:
import math
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize
import gensim
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings(action='ignore')

In [5]:
myfile = open('data/text.txt', 'r', encoding='utf8')
raw_data = myfile.read()
myfile.close()

In [14]:
data = raw_data.replace('\n', ' ')

sentences = []

for sent in sent_tokenize(data):
    words = []

    for word in word_tokenize(sent):
        words.append(word.lower())
    sentences.append(words)

print(sentences[:3])

[['december', ',', '1971', '[', 'etext', '#', '1', ']', 'the', 'project', 'gutenberg', 'etext', 'of', 'the', 'declaration', 'of', 'independence', '.'], ['all', 'of', 'the', 'original', 'project', 'gutenberg', 'etexts', 'from', 'the', '1970', "'s", 'were', 'produced', 'in', 'all', 'caps', ',', 'no', 'lower', 'case', '.'], ['the', 'computers', 'we', 'used', 'then', 'did', "n't", 'have', 'lower', 'case', 'at', 'all', '.']]


In [16]:
model = gensim.models.Word2Vec(sentences, min_count=1, vector_size=512, window=5, sg=1)
print(model)

Word2Vec<vocab=11806, vector_size=512, alpha=0.025>


The above model is Word2Vector model with skipgram training. It has 512 size word embeddings after the training process.

In [30]:
def similarity(model, word1, word2):
    
    try:
        a = model.wv[word1]
        b = model.wv[word2]
        
        a = a.reshape(1, 512)
        b = b.reshape(1, 512)

        return cosine_similarity(a, b)

    except:
        print("OOV word!")

    return None

In [31]:
word1="freedom"
word2="liberty"

print("Similarity", similarity(model, word1, word2), word1, word2)

Similarity [[0.3650815]] freedom liberty


In [32]:
word1="cooperations"
word2="rights"

print("Similarity", similarity(model, word1, word2), word1, word2)

OOV word!
Similarity None cooperations rights


In case where the considering word is not in the vocabulary, some models will not be able to provide a result. Like in above case. This may cause chain of issues if not handled properly.

Also on the other hand, if we used a tokenizer like Byte Pair tokenizer it may split the words to segments that may yield different results than the original meaning of the word. So it is important to have a human intervined quality controlling method in many of the critical decision making systems.

In [38]:
word1="books"
word2="ebooks"

print("Similarity", similarity(model, word1, word2), word1, word2)

Similarity [[0.20786054]] books ebooks


Check the above 2 words. They should be similar, at least to some extent. But in this case, it is not the case. This can happen to many reasons including the way these words have been used in the dataset, rararity of the considering word or purely because of some random noise. 

So it is important to make sure such issues are properly tested in rigourous manner.

In [39]:
word1="pay"
word2="debt"

print("Similarity", similarity(model, word1, word2), word1, word2)

Similarity [[0.49995035]] pay debt
