RUNS IN NEMO_GPU and torch_gpu ENVIRONMENT
Reference: https://colab.research.google.com/github/NVIDIA/NeMo/blob/r1.0.0rc1/tutorials/asr/08_ASR_with_Subword_Tokenization.ipynb
https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt
To install:
!pip install tokenizers



The script above takes a few important arguments -

    either --manifest or --data_file: If your text data lies inside of an ASR manifest file, then use the --manifest path. If instead the text data is inside a file with separate lines corresponding to different text lines, then use --data_file. In either case, you can add commas to concatenate different manifests or different data files.

    --data_root: The output directory (whose subdirectories will be created if not present) where the tokenizers will be placed.

    --vocab_size: The size of the tokenizer vocabulary. Larger vocabularies can accommodate almost entire words, but the decoder size of any model will grow proportionally.

    --tokenizer: Can be either spe or wpe . spe refers to the Google sentencepiece library tokenizer. wpe refers to the HuggingFace BERT Word Piece tokenizer. Please refer to the papers above for the relevant technique in order to select an appropriate tokenizer.

    --no_lower_case: When this flag is passed, it will force the tokenizer to create separate tokens for upper and lower case characters. By default, the script will turn all the text to lower case before tokenization (and if upper case characters are passed during training/inference, the tokenizer will emit a token equivalent to Out-Of-Vocabulary). Used primarily for the English language.

    --spe_type: The sentencepiece library has a few implementations of the tokenization technique, and spe_type refers to these implementations. Currently supported types are unigram, bpe, char, word. Defaults to bpe.

    --spe_character_coverage: The sentencepiece library considers how much of the original vocabulary it should cover in its "base set" of tokens (akin to the lower and upper case characters of the English language). For almost all languages with small base token sets (<1000 tokens), this should be kept at its default of 1.0. For languages with larger vocabularies (say Japanese, Mandarin, Korean etc), the suggested value is 0.9995.

    --spe_sample_size: If the dataset is too large, consider using a sampled dataset indicated by a positive integer. By default, any negative value (default = -1) will use the entire dataset.

    --spe_train_extremely_large_corpus: When training a sentencepiece tokenizer on very large amounts of text, sometimes the tokenizer will run out of memory or wont be able to process so much data on RAM. At some point you might receive the following error - "Input corpus too large, try with train_extremely_large_corpus=true". If your machine has large amounts of RAM, it might still be possible to build the tokenizer using the above flag. Will silently fail if it runs out of RAM.

    --log: Whether the script should display log messages



In [1]:
import tokenizers
import os

In [2]:
vocab_size = 1000 #CAN SPECIFY ANY MAXIMUM VALUE
tokenizer_type = "wpe"
dst_folder = "/media/rathna/New Volume/word_piece"
text_path = "/media/rathna/New Volume/word_piece/libri_text.txt"

tokenizer_dir = os.path.join(dst_folder, 'tokenizer_{}_v{}').format(tokenizer_type, vocab_size)

if not os.path.exists(tokenizer_dir):
    os.makedirs(tokenizer_dir)

tokenizer = tokenizers.BertWordPieceTokenizer(lowercase=False) #if true, treats upper and lower case as separate tokens

tokenizer.train(text_path, vocab_size=vocab_size)
tokenizer.save_model(tokenizer_dir)






['/media/rathna/New Volume/word_piece/tokenizer_wpe_v1000/vocab.txt']

In [10]:
# opening the file in read mode
my_file = open("/media/rathna/New Volume/word_piece/tokenizer_wpe_v1000/vocab.txt", "r")
  
# reading the file
data = my_file.read()
  
# replacing end splitting the text 
# when newline ('\n') is seen.
vocab = data.split("\n")
print(vocab)
my_file.close()

['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '##i', '##l', '##a', '##r', '##e', '##d', '##m', '##n', '##o', '##u', '##s', '##p', '##t', '##z', '##c', '##h', '##b', '##k', '##x', '##v', '##g', '##y', '##w', '##f', '##j', '##q', 'th', 'the', '##er', '##nd', '##in', '##ed', '##ou', '##at', '##en', 'and', '##es', 'to', '##or', 'of', '##on', '##is', '##ing', '##ar', '##as', '##an', '##it', '##ll', 'in', '##re', 'wh', '##om', 'he', 'ha', 'be', '##le', '##ic', '##ot', '##ow', 'was', '##ut', 'it', '##ld', 'that', '##ly', 'sh', '##gh', '##se', '##id', 'on', '##ve', '##ent', '##et', '##im', 'you', '##ion', '##ir', '##ce', '##st', 'as', '##ith', 'for', 'his', '##ay', '##al', '##ur', '##ter', 'with', 'st', '##ch', '##ver', 'her', 're', 'had', '##ad', '##ght', 'an', 'not', '##am', '##her', 'at', 'is', '##ess', '##oo', '##ould', 'but', '##ct', 'fr', 'she', 'se', 'we', '

In [4]:
def encode_word(word):
    tokens = []
    while len(word) > 0:
        i = len(word)
        while i > 0 and word[:i] not in vocab:
            i -= 1
        if i == 0:
            return ["[UNK]"]
        tokens.append(word[:i])
        word = word[i:]
        if len(word) > 0:
            word = f"##{word}"
    return tokens

In [5]:
print(encode_word("closed"))
print(encode_word("opened"))

['cl', '##osed']
['open', '##ed']


In [6]:
def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
    print(pre_tokenize_result)
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    print(pre_tokenized_text)
    encoded_words = [encode_word(word) for word in pre_tokenized_text]
    return sum(encoded_words, [])

In [7]:
text = "chapter one missus rachel lynde is surprised missus rachel lynde lived just where the avonlea main road dipped down into a little hollow fringed with alders and ladies eardrops and traversed by a brook"
print(tokenize(text))

[('chapter', (0, 7)), ('one', (8, 11)), ('missus', (12, 18)), ('rachel', (19, 25)), ('lynde', (26, 31)), ('is', (32, 34)), ('surprised', (35, 44)), ('missus', (45, 51)), ('rachel', (52, 58)), ('lynde', (59, 64)), ('lived', (65, 70)), ('just', (71, 75)), ('where', (76, 81)), ('the', (82, 85)), ('avonlea', (86, 93)), ('main', (94, 98)), ('road', (99, 103)), ('dipped', (104, 110)), ('down', (111, 115)), ('into', (116, 120)), ('a', (121, 122)), ('little', (123, 129)), ('hollow', (130, 136)), ('fringed', (137, 144)), ('with', (145, 149)), ('alders', (150, 156)), ('and', (157, 160)), ('ladies', (161, 167)), ('eardrops', (168, 176)), ('and', (177, 180)), ('traversed', (181, 190)), ('by', (191, 193)), ('a', (194, 195)), ('brook', (196, 201))]
['chapter', 'one', 'missus', 'rachel', 'lynde', 'is', 'surprised', 'missus', 'rachel', 'lynde', 'lived', 'just', 'where', 'the', 'avonlea', 'main', 'road', 'dipped', 'down', 'into', 'a', 'little', 'hollow', 'fringed', 'with', 'alders', 'and', 'ladies', 'e

In [11]:
vocab = vocab[5:-1]
print(len(vocab))
vocab = vocab+[' '] # Add apostrophe later
print(len(vocab))
print(vocab)

995
996
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '##i', '##l', '##a', '##r', '##e', '##d', '##m', '##n', '##o', '##u', '##s', '##p', '##t', '##z', '##c', '##h', '##b', '##k', '##x', '##v', '##g', '##y', '##w', '##f', '##j', '##q', 'th', 'the', '##er', '##nd', '##in', '##ed', '##ou', '##at', '##en', 'and', '##es', 'to', '##or', 'of', '##on', '##is', '##ing', '##ar', '##as', '##an', '##it', '##ll', 'in', '##re', 'wh', '##om', 'he', 'ha', 'be', '##le', '##ic', '##ot', '##ow', 'was', '##ut', 'it', '##ld', 'that', '##ly', 'sh', '##gh', '##se', '##id', 'on', '##ve', '##ent', '##et', '##im', 'you', '##ion', '##ir', '##ce', '##st', 'as', '##ith', 'for', 'his', '##ay', '##al', '##ur', '##ter', 'with', 'st', '##ch', '##ver', 'her', 're', 'had', '##ad', '##ght', 'an', 'not', '##am', '##her', 'at', 'is', '##ess', '##oo', '##ould', 'but', '##ct', 'fr', 'she', 'se', 'we', 'pr', 'sa', '##ere', 'him', 'so', '##il

In [12]:
vocab = sorted(vocab)
num_embeddings = len(vocab)
print(num_embeddings)
vocab_ids = [int(i) for i in range(num_embeddings)]
vocab_dict = dict(zip(vocab,vocab_ids))
print(f"Character mappings : {vocab_dict}")

996
Character mappings : {' ': 0, '##a': 1, '##ab': 2, '##able': 3, '##ac': 4, '##ace': 5, '##ach': 6, '##ack': 7, '##act': 8, '##ad': 9, '##ade': 10, '##ady': 11, '##ag': 12, '##age': 13, '##ail': 14, '##ain': 15, '##ained': 16, '##air': 17, '##ak': 18, '##ake': 19, '##aken': 20, '##aking': 21, '##al': 22, '##ale': 23, '##alk': 24, '##all': 25, '##ally': 26, '##als': 27, '##am': 28, '##ame': 29, '##amp': 30, '##an': 31, '##ance': 32, '##and': 33, '##ang': 34, '##ange': 35, '##ank': 36, '##ans': 37, '##ant': 38, '##ants': 39, '##ap': 40, '##aps': 41, '##ar': 42, '##ard': 43, '##are': 44, '##ared': 45, '##ark': 46, '##ars': 47, '##art': 48, '##ary': 49, '##as': 50, '##ase': 51, '##ash': 52, '##ason': 53, '##ass': 54, '##ast': 55, '##at': 56, '##atch': 57, '##ate': 58, '##ated': 59, '##ately': 60, '##ater': 61, '##ates': 62, '##ath': 63, '##ather': 64, '##ating': 65, '##ation': 66, '##ations': 67, '##atter': 68, '##au': 69, '##augh': 70, '##ause': 71, '##aut': 72, '##av': 73, '##ave': 74

In [15]:
import re

In [27]:
def new_tokenize(text):
    text_list = re.split(" ", text)
    new_list = []
    for i in range(0,len(text_list)-1):
        new_list.append(text_list[i])
        new_list.append(' ')
    new_list.append(text_list[-1])   
    #print(new_list)
    encoded_words = []
    for word in new_list:
        encoded_words += encode_word(word)
    print(encoded_words)
    mapped_words = [vocab_dict[word_piece] for word_piece in encoded_words]
    return mapped_words

In [28]:
print(new_tokenize(text))

['ch', '##ap', '##ter', ' ', 'one', ' ', 'missus', ' ', 'r', '##ach', '##el', ' ', 'l', '##y', '##nd', '##e', ' ', 'is', ' ', 'sur', '##p', '##ri', '##se', '##d', ' ', 'missus', ' ', 'r', '##ach', '##el', ' ', 'l', '##y', '##nd', '##e', ' ', 'li', '##ved', ' ', 'just', ' ', 'where', ' ', 'the', ' ', 'a', '##v', '##on', '##le', '##a', ' ', 'ma', '##in', ' ', 'ro', '##ad', ' ', 'd', '##ip', '##pe', '##d', ' ', 'down', ' ', 'into', ' ', 'a', ' ', 'little', ' ', 'ho', '##llow', ' ', 'fr', '##ing', '##ed', ' ', 'with', ' ', 'al', '##der', '##s', ' ', 'and', ' ', 'la', '##d', '##ies', ' ', 'ear', '##d', '##ro', '##ps', ' ', 'and', ' ', 'tra', '##ver', '##se', '##d', ' ', 'by', ' ', 'a', ' ', 'br', '##ook']
[500, 40, 360, 0, 764, 0, 724, 0, 812, 6, 105, 0, 682, 413, 252, 100, 0, 667, 0, 887, 318, 341, 348, 96, 0, 724, 0, 812, 6, 105, 0, 682, 413, 252, 100, 0, 693, 399, 0, 673, 0, 961, 0, 899, 0, 417, 397, 271, 236, 1, 0, 705, 187, 0, 829, 9, 0, 525, 200, 319, 96, 0, 542, 0, 666, 0, 417, 0, 69

In [30]:
sample = ['ch', '##ap', '##ter', ' ', 'one', ' ', 'missus', ' ', 'r', '##ach', '##el', ' ', 'l', '##y', '##nd', '##e', ' ', 'is', ' ', 'sur', '##p', '##ri', '##se', '##d', ' ', 'missus', ' ', 'r', '##ach', '##el', ' ', 'l', '##y', '##nd', '##e', ' ', 'li', '##ved', ' ', 'just', ' ', 'where', ' ', 'the', ' ', 'a', '##v', '##on', '##le', '##a', ' ', 'ma', '##in', ' ', 'ro', '##ad', ' ', 'd', '##ip', '##pe', '##d', ' ', 'down', ' ', 'into', ' ', 'a', ' ', 'little', ' ', 'ho', '##llow', ' ', 'fr', '##ing', '##ed', ' ', 'with', ' ', 'al', '##der', '##s', ' ', 'and', ' ', 'la', '##d', '##ies', ' ', 'ear', '##d', '##ro', '##ps', ' ', 'and', ' ', 'tra', '##ver', '##se', '##d', ' ', 'by', ' ', 'a', ' ', 'br', '##ook']
sample_text = ''
for i in range(len(sample)):
    sample_text += sample[i]
print(sample_text)

ch##ap##ter one missus r##ach##el l##y##nd##e is sur##p##ri##se##d missus r##ach##el l##y##nd##e li##ved just where the a##v##on##le##a ma##in ro##ad d##ip##pe##d down into a little ho##llow fr##ing##ed with al##der##s and la##d##ies ear##d##ro##ps and tra##ver##se##d by a br##ook


In [31]:
my_new_string = sample_text.replace("#", "")
print(my_new_string)

chapter one missus rachel lynde is surprised missus rachel lynde lived just where the avonlea main road dipped down into a little hollow fringed with alders and ladies eardrops and traversed by a brook
