## Docs

### Frequency Tokenizer

In [2]:
import tokenizers as tk

SyntaxError: invalid syntax (tokenizers.py, line 187)

Read, preprocess then train

In [2]:
tokenizer = tk.FrequencyTokenizer()
tokenizer.process_data('samples/data.txt')
tokenizer.train()

Reading the data ...
Splitting the data ...


Tokenize 

In [3]:
tokenizer.tokenize("السلام عليكم")

['السلام', 'عليكم']

Encode as ids

In [4]:
tokenizer.encode("السلام عليكم")

[536, 829]

Decode back to tokens

In [5]:
tokenizer.decode([536, 829])

['السلام', 'عليكم']

### SentencePiece Tokenizer

Read, preprocess then train

In [2]:
tokenizer = tk.SentencePieceTokenizer()
tokenizer.process_data('samples/data.txt')
tokenizer.train()

Reading the data ...
Splitting the data ...


Tokenize 

In [3]:
tokenizer.tokenize("صباح الخير يا أصدقاء")

['▁صباح', '▁الخير', '▁يا', '▁أص', 'د', 'قاء']

Encode as ids

In [4]:
tokenizer.encode("صباح الخير يا أصدقاء")

[3777, 1424, 78, 423, 9962, 560]

Decode back to tokens

In [5]:
tokenizer.decode([3777, 1424, 78, 423, 9962, 560])

['▁صباح', '▁الخير', '▁يا', '▁أص', 'د', 'قاء']

### Auto Tokenizer

Read, preprocess then train

In [10]:
tokenizer = tk.AutoTokenizer()
tokenizer.process_data('samples/data.txt')

loading vocab ...
Reading the data ...
Splitting the data ...


Tokenize 

In [11]:
tokenizer.tokenize("السلام عليكم")

['ال', '##سلام', 'علي', '##كم']

Encode as ids

In [12]:
tokenizer.encode("السلام عليكم")

[1, 3834, 8716, 4957]

Decode back to tokens

In [13]:
tokenizer.decode([1, 3834, 8716, 4957])

['ال', '##سلام', 'علي', '##كم']

### Random Tokenizer

In [1]:
import tokenizers as tk
tokenizer = tk.RandomTokenizer()
tokenizer.process_data('samples/data.txt')
tokenizer.train()

Reading the data ...
Splitting the data ...
Training ...


In [7]:
tokenizer.tokenize("السلام عليكم أيها الأصدقاء")

['ال', 'سلام##', 'علي', 'كم##', 'أي', 'ها##', 'ال', 'أصد##', 'قاء##']

### Large Files

We can use memory mapping to extract token's frequency for large files. It uses `mmap` to process chunks of the data at each iteration step. 

In [1]:
import time
import tokenizers as tk

In [9]:
# initialize
tokenizer = tk.FrequencyTokenizer()
tokenizer.process_data('samples/data.txt')

# calculating time with memory mapping
start_time = time.time()
tokenizer.train(large_file = True)
end_time = time.time()
time_with_mmap = end_time - start_time

# calculating time witout memory mapping
start_time = time.time()
tokenizer.train(large_file = False)
end_time = time.time()
time_without_mmap = end_time - start_time

0it [00:00, ?it/s]

Reading the data ...
Splitting the data ...


1it [00:00,  4.39it/s]


In [8]:
print('Time with memory mapping ', time_with_mmap)
print('Time without memory mapping ', time_without_mmap)

Time with memory mapping  0.3706831932067871
Time without memory mapping  0.36846137046813965


### Tokenization vs Segmentation 

We can use tokenization to segment words using a pretrained dictionary. This makes segmentation very fast as compared to
using libraries like `farasa`.

In [2]:
tokenizer = tk.AutoTokenizer()
start_time = time.time()
tokenizer.process_data('samples/data.txt')
out =tokenizer.tokenize(open('data/raw/train.txt').read())
end_time = time.time()
print(end_time - start_time)

loading vocab ...
Reading the data ...
Splitting the data ...
4.019087553024292


In [3]:
tokenizer = tk.FrequencyTokenizer(segment = True)
start_time = time.time()
tokenizer.process_data('samples/data.txt')
end_time = time.time()
print(end_time - start_time)

Initializing Farasa




Reading the data ...
Segmenting the data ...
Splitting the data ...
44.405993700027466


### Export Models

Models can be saved for deployment and reloading.

In [3]:
tokenizer = tk.FrequencyTokenizer()
tokenizer.process_data('samples/data.txt')
tokenizer.train()
tokenizer.save_model('freq.pl')

Reading the data ...
Splitting the data ...
Saving as pickle file ...


load model without pretraining

In [4]:
tokenizer = tk.FrequencyTokenizer()
tokenizer.process_data('samples/data.txt')
tokenizer.load_model('freq.pl')

Reading the data ...
Splitting the data ...
Loading as pickle file ...
