<a href="https://colab.research.google.com/github/Kamiliaadil/Tokenization-Lab/blob/main/tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this lab, I'm starting with the American Stories dataset to get a feel for how the Tokenization code works. After that, I plan to switch to another dataset from Hugging Face called the 'wikitext' dataset. This dataset contains text from Wikipedia articles in Serbian, totaling around 477,473 articles. Additionally, it may include some content from Wikisource.

In [None]:
!pip install datasets



In [None]:
import nltk
import timeit
from datasets import load_dataset

#American Stories Dataset

In [None]:
# Decide what year you want between 1810 and 1963
my_year = "1950"

# Decide how many articles you want to work with
num_articles = 10

#  Download data for your choice of year (1810 to 1963)
dataset = load_dataset("dell-research-harvard/AmericanStories",
    "subset_years",
    year_list=[my_year]
)

# Get the first n articles from that year
# instantiate the counter
i=0
# instantiate the string
my_articles = ''
# loop through each article for that year
for article in dataset[my_year]:
    #the article is a dictionary,
    #we're getting the text of the article by accessing the key, "article"
    my_articles += article.get('article')
    #add one to our counter
    i+=1
    #if the counter is greater than num_articles-1, stop looping
    if i>(num_articles-1): break

In [None]:
#remove new line and other formatting characters
for char in ["\n", "\r", "\d", "\t"]:
    my_articles = my_articles.replace(char, " ")
my_articles[:1000]

"ONE HOLSTEIN MILK COW-Will have 3rd calf this month; one 6-yr.- old mare, broken to work with a q nr9la moving ea @sehae fsnsls IL seod soadon hGGSs M. GASSETT, LO. 3-S594. % GUERNSEY cow and 2-month-o -old heifer calf, for sale. Call TO. 5413. BERKSHIRE BROOD sows, some registered; 8 choice pigs; excellent soRREL ssSia tG RDN ISL tsdaS aSRsI sE xs zhss whit. hands tall; excellent condition; gentle snSo: SIS aa;hs' dE. 41ss. Ia '0' ia0' OF 'niio' BOARD AND BOX STALL for your horse. Potomac Hunt Club area. Rockville 364S.   EXCELLENT HUNTER - Handsome chestnut gelding, 1873 hands. can lady; success in horse shows; safe, willing jumper; smooth gaits for hacking. quiet, well hammered; rea- sonable price for this perfectly schooled horse. Tower 5044. PALoMINO STALLION, white mark ings, 8 years old, fine blood line, MR s1ea eRoxTa eeRisoeSxsTls9Sa gf s s6ss s5ssen to ride WORK MARE, 5-year -old, black short and stocky. Farm broken. 81. HEREFORD HEFERS, 10 mead, some bred, purebred bull. SH

#Whitespace Tokenization

In [None]:
%%time
#this is a magic function to determine how long a cell takes to run.
#It MUST be the first thing in a cell

#split the whole string on spaces. This returns a list
whitespace_tokens = my_articles.split(' ')

#check the list
whitespace_tokens[:20]

CPU times: user 152 µs, sys: 8 µs, total: 160 µs
Wall time: 164 µs


['ONE',
 'HOLSTEIN',
 'MILK',
 'COW-Will',
 'have',
 '3rd',
 'calf',
 'this',
 'month;',
 'one',
 '6-yr.-',
 'old',
 'mare,',
 'broken',
 'to',
 'work',
 'with',
 'a',
 'q',
 'nr9la']

# Morphological Tokenization

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
#This lemmatizer is based on the Morphy project above
from nltk.stem import WordNetLemmatizer
wn_lemmatizer = WordNetLemmatizer()

In [None]:
%%time

#first we have to split the string on spaces to get "words"
whitespace_tokens = my_articles.split(' ')

my_lemmas = []
for word in whitespace_tokens:
    w = wn_lemmatizer.lemmatize(word)
    my_lemmas.append(w)
my_lemmas[:20]

CPU times: user 36.8 ms, sys: 63 µs, total: 36.9 ms
Wall time: 40.8 ms


['ONE',
 'HOLSTEIN',
 'MILK',
 'COW-Will',
 'have',
 '3rd',
 'calf',
 'this',
 'month;',
 'one',
 '6-yr.-',
 'old',
 'mare,',
 'broken',
 'to',
 'work',
 'with',
 'a',
 'q',
 'nr9la']

# Byte Pair Encoding

In [None]:
!pip install bpe
from bpe import Encoder



In [None]:
%%time
whitespace_tokens = my_articles.split(' ')

# calling the Encoder algorithm
# we've specified 100 token vocab and 95% to be tokenized
# the other 5% is transformed into UNK
encoder = Encoder(100, pct_bpe=0.95)
encoder.fit(whitespace_tokens)

CPU times: user 27.9 ms, sys: 986 µs, total: 28.9 ms
Wall time: 28.9 ms


In [None]:
#print(encoder.tokenize(my_articles))

print(next(encoder.inverse_transform(encoder.transform([my_articles]))))

one holstein milk cow - will have 3rd calf this month ; one 6 - yr .- old mare , broken to work with a __unk nr9la moving ea __unk sehae fsnsls il seod soadon hggss m . gassett , lo . 3 - s594 . __unk guernsey cow and 2 - month - o - old heifer calf , for sale . call to . 5413 . berkshire brood sows , some registered ; 8 choice pigs ; excellent sorrel sssia tg rdn isl tsdas asrsi se xs __unkhss whit . hands tall ; excellent condition ; gentle snso __unk sis aa ; hs ' de . 41ss . ia ' 0 ' ia0 ' of ' niio ' board and box stall for your horse . potomac hunt club area . rockville 364s . excellent hunter - handsome chestnut gelding , 18__unk3 hands . can lady ; success in horse shows ; safe , willing __unkumper ; smooth gaits for hacking . __unkuiet , well hammered ; rea - sonable price for this perfectly schooled horse . tower 5044 . palomino stallion , white mark ings , 8 years old , fine blood line , mr s1ea eroxta eerisoesxstls9sa gf s s6ss s5ssen to ride work mare , 5 - year - old , bl

#Wikitext Dataset

##Whitespace Tokenization

In [None]:
# Decide how many articles you want to work with
num_articles = 10

# Load the "SrpWikiDataset"
dataset = load_dataset("datatab/SrpWikiDataset")

# Get the first n articles from the dataset
# Instantiate the counter
i = 0
# Instantiate the string
my_articles = ''
# Loop through each article in the dataset
for article in dataset['train']:
    # The article is a dictionary, get the text of the article
    my_articles += article['text']
    # Add one to the counter
    i += 1
    # If the counter is greater than num_articles-1, stop looping
    if i > (num_articles - 1):
        break

# Validate that it is what we expect by checking the first 1000 characters
print(my_articles[:1000])

Downloading readme:   0%|          | 0.00/636 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/258M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/3796604 [00:00<?, ? examples/s]

UTF-8 UTF-8 varijanta je najzgodnija za kodiranje većinski latiničnog teksta.Dato je i kratko uputstvo za korišćenje te varijante u Microsoft Word-u, Netscape Composer-u i tekstualnom editoru Kate.U tekstu su takođe preporučeni standardni Unicode fontovi koji omogućavaju laku prenosivost teksta sa računara na računar ili za objavljivanje teksta na Internet.Prvi računari su bili pravljeni pretežno za englesko govorno područje i imali su podršku samo za engleski alfabet, za brojeve, zagrade i još po neki kontrolni karakter, što je činilo ukupno 128 mogućih slova (u 7 bita).To je bio tzv. -{ASCII}- ili -{US-ASCII}- standard.1968. godine je skup karaktera proširen na 256 (8 bita), a gornjih 128 karaktera je bilo korišćeno za dodatne karaktere.Iz neke navike je i ovaj prošireni -{ASCII}- nazivan -{ASCII}-, tako da tu često dolazi do zabune.Da bi postojala podrška za više jezika, smišljane su tzv. kodne strane (-{Code Page}-) koje definišu ponašanje tog dodatnog skupa slova.Osnovna kodna str

#Since the displayed text is in Serbian, I'm using Google Translate to convert it into English.

In [None]:
!pip install googletrans==4.0.0-rc1



In [None]:
from datasets import load_dataset
from googletrans import Translator

# Define the name of the "SrpWikiDataset" dataset
dataset_name = "srp_wiki"

# Decide how many articles you want to work with
num_articles = 10

# Load the "SrpWikiDataset"
dataset = load_dataset("datatab/SrpWikiDataset")

# Create a translator
translator = Translator()

# Get the first n articles from the dataset and translate to English
# Instantiate the counter
i = 0
# Instantiate the string
my_articles = ''
# Loop through each article in the dataset
for article in dataset['train']:
    # The article is a dictionary, get the text of the article
    article_text = article['text']

    # Translate the article text from Serbian to English
    translated_text = translator.translate(article_text, src='sr', dest='en').text

    my_articles += translated_text
    # Add one to the counter
    i += 1
    # If the counter is greater than num_articles-1, stop looping
    if i > (num_articles - 1):
        break

# Validate that it is what we expect by checking the first 1000 characters
print(my_articles[:1000])

The UTF-8 UTF-8 variant is the most common for coding the majority latin text.The short manual for use of this variant in Microsoft Word, Netscape Composer and the Kate's text editor is also given.The text is also recommended by standard unicode fonts that allow easy portability of text from your computer to a computer or to publish text to the Internet.The first computers were predominantly for the English speech area and had support only for English alphabet, for numbers, brackets and some control character, which made a total of 128 possible letters (in 7 bits).It was the so-called.- {ASCII} - or - {US-ASCII} - Standard.In 1968, the character set expanded to 256 (8 bits), and the upper 128 characters was used for additional characters.From some habit, this extended - {ascii} - called - {ascii} -, so that often comes to confusion.In order to support for several languages, the so-called are designed.Code sides (- {code page} -) that define the behavior of that additional set of letter

In [None]:
#remove new line and other formatting characters
for char in ["\n", "\r", "\d", "\t"]:
    my_articles = my_articles.replace(char, " ")
my_articles[:1000]

"The UTF-8 UTF-8 variant is the most common for coding the majority latin text.The short manual for use of this variant in Microsoft Word, Netscape Composer and the Kate's text editor is also given.The text is also recommended by standard unicode fonts that allow easy portability of text from your computer to a computer or to publish text to the Internet.The first computers were predominantly for the English speech area and had support only for English alphabet, for numbers, brackets and some control character, which made a total of 128 possible letters (in 7 bits).It was the so-called.- {ASCII} - or - {US-ASCII} - Standard.In 1968, the character set expanded to 256 (8 bits), and the upper 128 characters was used for additional characters.From some habit, this extended - {ascii} - called - {ascii} -, so that often comes to confusion.In order to support for several languages, the so-called are designed.Code sides (- {code page} -) that define the behavior of that additional set of lette

Whitespace Tokenization

In [None]:
%%time
#this is a magic function to determine how long a cell takes to run.
#It MUST be the first thing in a cell

#split the whole string on spaces. This returns a list
whitespace_tokens = my_articles.split(' ')

#check the list
whitespace_tokens[:20]

CPU times: user 66 µs, sys: 0 ns, total: 66 µs
Wall time: 70.1 µs


['The',
 'UTF-8',
 'UTF-8',
 'variant',
 'is',
 'the',
 'most',
 'common',
 'for',
 'coding',
 'the',
 'majority',
 'latin',
 'text.The',
 'short',
 'manual',
 'for',
 'use',
 'of',
 'this']

Morphological Tokenization

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
#This lemmatizer is based on the Morphy project above
from nltk.stem import WordNetLemmatizer
wn_lemmatizer = WordNetLemmatizer()

In [None]:
%%time

#first we have to split the string on spaces to get "words"
whitespace_tokens = my_articles.split(' ')

my_lemmas = []
for word in whitespace_tokens:
    w = wn_lemmatizer.lemmatize(word)
    my_lemmas.append(w)
my_lemmas[:20]

CPU times: user 1.5 ms, sys: 0 ns, total: 1.5 ms
Wall time: 1.52 ms


['The',
 'UTF-8',
 'UTF-8',
 'variant',
 'is',
 'the',
 'most',
 'common',
 'for',
 'coding',
 'the',
 'majority',
 'latin',
 'text.The',
 'short',
 'manual',
 'for',
 'use',
 'of',
 'this']

Byte Pair Encoding

In [None]:
%%time
whitespace_tokens = my_articles.split(' ')

# calling the Encoder algorithm
# we've specified 100 token vocab and 95% to be tokenized
# the other 5% is transformed into UNK
encoder = Encoder(100, pct_bpe=0.95)
encoder.fit(whitespace_tokens)

CPU times: user 2.92 ms, sys: 0 ns, total: 2.92 ms
Wall time: 3.1 ms


In [None]:
#print(encoder.tokenize(my_articles))

print(next(encoder.inverse_transform(encoder.transform([my_articles]))))

the utf - 8 utf - 8 variant is the most common for coding the ma__unkority latin text . the short manual for use of this variant in microsoft word , netscape composer and the __unkate __unk s text editor is also given . the text is also recommended by standard unicode fonts that allow easy portability of text from your computer to a computer or to publish text to the internet . the first computers were predominantly for the english speech area and had support only for english alphabet , for numbers , brac__unkets and some control character , which made a total of __unk__unk8 possible letters __unk in __unk bits __unk__unk it was the so - called __unk- { ascii } - or - { us - ascii } - standard . in __unk__unk__unk8 , the character set expanded to __unk__unk__unk __unk 8 bits __unk, and the upper __unk__unk8 characters was used for additional characters . from some habit , this extended - { ascii } - called - { ascii } -, so that often comes to confusion . in order to support for severa

The results show that all three tokenization methods gave similar text outputs. It doesn't seem to matter which method I use in this case.
As for the time it took to process the text, all methods were pretty quick. Even the slowest one, Byte Pair Encoding, only took a few milliseconds. So, the choice of tokenization method doesn't slow things down much for this text.
In summary, Whitespace Tokenization is the quickest method for this text dataset.