# Unsupervised Subword Tokenizers vs. Morphology

Let's explore how unsupervised tokenizers, commonly used in Deep Learning, relate to the more linguistic aspects of Morphology. Your task is to tweek the code in order to see if subword tokenization could be a proxy for real morphological analysis. 




## Things you may need to do before running the code

### Install NLTK and Tokenizers packages:

```
pip install tokenizers
pip install nltk
```

### Download the Brown Corpus from NLTK


```
import nltk
nltk.download('brown')
```
 

In [None]:
# !pip install tokenizers
# !pip install nltk

In [None]:
import nltk
from nltk.corpus import brown

corpus_f = open("brown-corpus.txt", "w+")

# count = 0
# vocab = set()
for s in brown.sents():
    corpus_f.write(" ".join(s) + '\n')
    
#     words =str(s).split()
#     count += len(words)
#     vocab.update(words)

# print("No. of words:", count)
# print("No. of unique words:", len(vocab))

# Tokenizers

In [None]:
from tokenizers import Tokenizer

from tokenizers import pre_tokenizers
from tokenizers.pre_tokenizers import Whitespace

from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer

from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer

In [None]:
VOCAB_SIZE = 100   # You should be playing with this threshold

# Byte-Pair Encoding (BPE)  tokenization

In [None]:
BPE_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

trainer = BpeTrainer(vocab_size=VOCAB_SIZE, 
                     special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

BPE_tokenizer.pre_tokenizer = Whitespace()    # This is optional...

files = ["brown-corpus.txt"]

BPE_tokenizer.train(files, trainer)

BPE_tokenizer.save("BPE-tokenizer.json")

# Wordpiece tokenization

In [None]:
WP_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

WP_trainer = WordPieceTrainer(vocab_size=VOCAB_SIZE,
                              special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

WP_tokenizer.pre_tokenizer = Whitespace()    # This is optional...

files = ["brown-corpus.txt"]

WP_tokenizer.train(files, WP_trainer)

WP_tokenizer.save("WP-tokenizer.json")

#  Unigram tokenization

In [None]:
UG_tokenizer = Tokenizer(Unigram())

UG_trainer = UnigramTrainer(vocab_size=VOCAB_SIZE,
                            unk_token="<UNK>",
                            special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

UG_tokenizer.pre_tokenizer = Whitespace()    # This is optional... 

files = ["brown-corpus.txt"]

UG_tokenizer.train(files, UG_trainer)

UG_tokenizer.save("UG-tokenizer.json")

# Let's compare the tokenizers

Your task here will be to use a small evaluation corpus to test how the different algorithms perform against one another, while varying the size of the vocabulary above.

Feel free to add other words see how they are segmented (but you need to provide a gold segmentation for it to work).

In [None]:
# Some data extracted from https://github.com/sigmorphon/2022SegmentationST
test_corpus = [
    ["assistant", ["assist","ant"]],
    ["assistants", ["assist","ant","s"]],
    ["assist", ["assist"]],
    ["assisted",["assist","ed"]],
    ["assisting", ["assist","ing"]],
    ["assistance",["assist", "ance"]],
    ["assistive", ["assist","ive"]],
    ["assistful", ["assist","ful"]],
    ["assister", ["assist","er"]],
    ["unassisted", ["un","assist","ed"]],
    ["coassistance", ["co","assist","ance"]],
    ["coassists", ["co","assist","s"]],
    ["overassisting",["over","assist","ing"]],
    ["entaming", ["en", "tame", "ing"]],
    ["hoarders", ["hoard", "er", "s"]],
    ["visitorship", ["visit","or","ship"]],
    ["reorganises", ["re","organise","s"]],
    ["wargamer", ["war","game","er"]],               
    ["encodability", ["en","code","ability"]],
    ["healthy", ["health","y"]],
    ["buildings", ["build","ing","s"]],
    ["socioeconomy", ["socio","economy"]],
]    

In [None]:
for instance in test_corpus:
    print(instance)

In [None]:
count_wp, count_bpe, count_ug = 0, 0, 0

report = ""
for word, morphs  in test_corpus:
    
    wp = WP_tokenizer.decode(WP_tokenizer.encode(word).ids).replace("#",'').split()
    bpe = BPE_tokenizer.decode(BPE_tokenizer.encode(word).ids).split()
    ug = UG_tokenizer.decode(UG_tokenizer.encode(word).ids).split()

    if wp==morphs:
        count_wp += 1
    if bpe==morphs:
        count_bpe += 1
    if ug==morphs:
        count_ug += 1

        
    report = report + "GOLD: " + " ".join(morphs) + "\n"

    report = report + "Wordpiece: " + WP_tokenizer.decode(WP_tokenizer.encode(word).ids).replace("#",'') + "\n"

    report = report + "BPE: " + BPE_tokenizer.decode(BPE_tokenizer.encode(word).ids) + "\n"

    report = report + "Unigram: " + UG_tokenizer.decode(UG_tokenizer.encode(word).ids) + "\n"
    
    report = report + "------------------------------------------\n"


print("\n")
print("------------------------------------------")
print("RESULTS:")
print("------------------------------------------")
print("Wordpiece:", count_wp)
print("BPE:", count_bpe)
print("Unigram:", count_ug)
print("------------------------------------------")
print("\n\n")
print(report)
