# Custom Tokenizer

## Introduction

In this notebook we will develop a custom tokenizer, train it and store it.


In order to do this, this notebook consists of the following sections:
- Data preparation: loading all required data in the proper form
- EDA: exploration of the data [of dit niet meer?]
- Creating a dictionary with the morphological segmentations of Dutch words
- Creating the tokenizer using this dictionary
- Evaluating the tokenizer and comparing it to other tokenizers

## Data preparation

We have two main sources of data:
- OSCAR: a corpus with a lot of Dutch text data
- CELEX: a database with information about Dutch words

### OSCAR 


There are a couple of ways to work with the OSCAR corpus. The main choice we have to make is whether we want to download the corpus to our personal computer first, or download the dataset (in segments, as the dataset is too large to lead into memory at once) from the Hugging Face library when we want to use it for a certain task. 

Downloading the corpus manually is done in 45 segments. This means we could load one of these segments into memory at once, but it is easier to make a generator that behaves in the same way as the one that is necessary for streaming the dataset directly from Hugging Face, so that our functions can handle both. 

We will also create a small dataset in that can be used for testing some functions.

In [2]:
# from datasets import load_dataset, DatasetDict
# import os


# # function that returns a dictionary with a generator for every existing OSCAR file in this computer
# def create_local_oscar_generators(data_path, i=0, j=0):

#     out = {}
    
#     if j > i:
#         n = j - i

#         for x in range(i, j+1):
#             full_path = os.path.join(data_path, 'OSCAR', f'nl_part_{x}.txt')
#             if os.path.isfile(full_path):
#                 out[f'oscar{i}'] = create_text_generator(load_dataset('text', data_files={"train": full_path}, split='train', streaming=True))
        
#         if len(out) != n + 1:
#             print('Not all parts requested are on this computer')
    
#     else:

#         for i in range(1, 50):
#             full_path = os.path.join(data_path, 'OSCAR', f'nl_part_{i}.txt')
#             if os.path.isfile(full_path):
#                 out[f'oscar{i}'] = create_text_generator(load_dataset('text', data_files={"train": full_path}, split='train', streaming=True))


#     return out


# # function that creates one generator out of multiple generators
# def create_super_generator(generator_dict, list_input=False):

#     if list_input:
#         for generator in generator_dict:
#             yield from generator
#     else:
#         for generator in generator_dict.values():
#             yield from generator


# # one function to create OSCAR generator by combining n parts of the dataset, from part i to part j
# def create_super_local_oscar_generator(data_path, i=0, j=0):
    
#     if j > i:
#         generators = create_local_oscar_generators(data_path, i=i, j=j)
#     else:
#         generators = create_local_oscar_generators(data_path)

#     return create_super_generator(generators)


# # function to create a dataset with text 
# def create_test_set(dataset_generator, start, end):
#     it = iter(dataset_generator)
#     for _ in range(start):
#         next(it)
#     for _ in range(end - start + 1):
#         yield next(it)


# # function to turn a generator that returns a dictionary with 'text' as key into a generator of the values
# def create_text_generator(gen):
#     for i in gen:
#         yield i['text']




In [3]:
# # set path to datasets
# data_path = '/Users/jan/Documents/Master/Thesis/Code/Datasets'

# # download from huggingface
# dataset_from_hub = load_dataset('oscar', 'unshuffled_deduplicated_nl', split='train', streaming=True, trust_remote_code=True)

# # create from local files
# oscar1 = os.path.join(data_path, 'OSCAR', 'nl_part_1.txt')
# data_files = {"train": oscar1}
# oscar_it_dict = load_dataset('text', data_files=data_files, split='train', streaming=True)

# # create dictionary with a generator for every part
# gen_dict = create_local_oscar_generators(data_path)

# # create from local files
# oscar_gen_1 = gen_dict['oscar1']
# oscar_gen_2 = gen_dict['oscar2']

# # create super generator from all OSCAR files on computer
# oscar_gen_super = create_super_local_oscar_generator(data_path)

# # create small dataset (uneven number of lines)
# oscar_gen_small = create_test_set(oscar_gen_1, 0, 100007)

### CELEX

The CELEX database consists of more than 10 datasets, all focused on different features of language. For out purpuses, we use two of these datasets:
- one with morphological segmentations
- one with information that we can use to create groups of related words

In [1]:
# set path to datasets
data_path = '/Users/jan/Documents/Master/Thesis/Code/Datasets'

In [2]:
celex = os.path.join(data_path, 'CELEX-2-NL', 'DUTCH', 'DML', 'DML.CD')
celex2 = os.path.join(data_path, 'CELEX-2-NL', 'DUTCH', 'DFW', 'DFW.CD')

### SimLex

In [3]:
def load_simlex(simlex999, scores=False):
    # create list with a tuple for every word pair in the form of (word_1, word_2, similarity score, POS-tag)
    word_pairs = []

    # create a set with all words
    words_set = set([])

    with open(simlex999) as simlex:
        
        next(simlex) # skip first line
        
        for line in simlex:
    
            split = line.strip().split('\t')
            word_pairs.append(tuple(split))
            words_set.add(split[0])
            words_set.add(split[1])

    # create a list of unique words
    simlex_words = list(words_set)

    if scores:
        return word_pairs
    else:
        return simlex_words
        

In [4]:
simlex_path = os.path.join(data_path, 'SimLex-999', 'SimLex-999-Dutch-final.txt')
simlex_words = load_simlex(simlex_path)
simlex_pairs = load_simlex(simlex_path, scores=True)

## Morfessor

This is an existing module that is based on several (statistical) methods to morphilogically segment a word. A model is trained with a list of words. We will first do this for English and then for Dutch. 

In [None]:
# import nltk
# from nltk.corpus import words


# # using nltk word corpus as training data
# words = words.words()
# outfile = open("words", "w")
# for word in words:
#     outfile.write(word+"\n")

# outfile.close()

In [None]:
# import math
# import morfessor

# # function for adjusting the counts of each compound
# def log_func(x):
#     return int(round(math.log(x + 1, 2)))

# infile = "words"
# io = morfessor.MorfessorIO()
# train_data = list(io.read_corpus_file(infile))
# model = morfessor.BaselineModel()
# model.load_data(train_data, count_modifier=log_func)
# model.train_batch()
# io.write_binary_model_file("model.bin", model)

...........................................................
...........................................................
...........................................................
...........................................................
...........................................................
...........................................................


In [None]:

# model_file = "model.bin"
# io = morfessor.MorfessorIO()
# model = io.read_binary_model_file(model_file)

# word = 'untestably'
# # for segmenting new words we use the viterbi_segment(compound) method
# print(model.viterbi_segment(word)[0])

['un', 'test', 'ably']


#### Dutch version

Let's train a Dutch model now. For this we only need a list of Dutch words. I have used this one: https://github.com/OpenTaal/opentaal-wordlist


It contains over 400.000 Dutch words

In [None]:
# def log_func(x):
#     return int(round(math.log(x + 1, 2)))

# infile = "wordlist.txt"
# io = morfessor.MorfessorIO()
# train_data = list(io.read_corpus_file(infile))
# model_nl = morfessor.BaselineModel()
# model_nl.load_data(train_data, count_modifier=log_func)
# model_nl.train_batch()
# io.write_binary_model_file("model_nl.bin", model_nl)

...........................................................
...........................................................
...........................................................
...........................................................
...........................................................


In [None]:
# model_file = "model_nl.bin"
# io = morfessor.MorfessorIO()
# model_nl = io.read_binary_model_file(model_file)

# word = 'huisarrest'
# # for segmenting new words we use the viterbi_segment(compound) method
# print(model_nl.viterbi_segment(word)[0])

['huis', 'arrest']


Let's say we want to use this model in our tokenization algorithm. Let's first see how long it takes to tokenize all 400.000 words with this model. 

In [None]:
# words_nl = []

# with open('wordlist.txt') as file:
#     for i, line in enumerate(file):
#         words_nl.append(line.strip())


In [None]:
# segmented_words = {}
# for word in words_nl:
#     segmented_words[word] = model_nl.viterbi_segment(word)[0]

Luckily this is pretty fast, which means no problems will arise when we use it in our tokenization algorithm.






In [None]:
# class CustomTokenizerMorfessor:

#     def __init__(self):
#         self.vocab = {'UNK': 0}
#         self.n = 0
#         self.max_vocab_size = 50000
#         self.model = morfessor.MorfessorIO().read_binary_model_file("model_nl.bin") # this could of course be done differently, for instance by passing the model as argument, but it's fine for now
    
#     def get_vocab(self):
#         return self.vocab # note: we should probably use a getter here, but for now this is ok
    
#     def normalize(self, seq):
#         return seq.lower() 
    
#     def pre_tokenize(self, seq):
#         return seq.split()
    
#     def create_vocab(self, tokens):
#         for token in tokens:
#             if token not in self.vocab and len(self.get_vocab()) < self.max_vocab_size:
#                 self.n += 1
#                 self.vocab[token] = self.n
    
#     def encode(self, seq):
#         seq = self.pre_tokenize(self.normalize(seq))
#         seq = [self.model.viterbi_segment(word)[0] for word in seq]
#         seq = [item for sublist in seq for item in (sublist if isinstance(sublist, list) else [sublist])]
#         return [self.vocab[token] if token in self.vocab else self.vocab['UNK'] for token in seq]
    
#     def decode(self, ids: list[int]):
#         assert type(ids) == list
#         assert type(ids[0]) == int   # dit kan wel netter, volgens mij kan het al met alleen type hints
#         inverted_vocab = {value: key for key, value in self.vocab.items()}  # met een getter zou je dit niet elke keer opnieuw hoeven doen. Maar let altijd op of als je de een update je de ander ook update
#         out = ''
#         for idx in ids:
#             out += inverted_vocab[idx] + ' '
#         return out
    
#     def tokenize(self, seq):
#         inverted_vocab = {value: key for key, value in self.vocab.items()}  # met een getter zou je dit niet elke keer opnieuw hoeven doen. Maar let altijd op of als je de een update je de ander ook update
#         return [inverted_vocab[idx] if idx in inverted_vocab else inverted_vocab[0] for idx in self.encode(seq)]

#     def __call__(self, seq):
#         ids = self.encode(seq)
#         types = [0 for token in ids]
#         attention = [1 for token in ids]
#         return {'input_ids': ids, 'token_type_ids': types, 'attention_mask': attention}

There are multiple ways to use this morphological segmentation model in our tokenizer. Let's first see in how many unique parts the 400.000 words are split up.

In [None]:
# parts = set([])

# for i in segmented_words.values():
#     for j in i:
#         parts.add(j)

# parts = list(parts)

In [None]:
# print(f'The {len(segmented_words)} words in the database are split up into {len(parts)} unique units')

The 413937 words in the database are split up into 46463 unique units


As we can see, the more than 400.000 words can be represented by less than 47.000 tokens. This is actually a fairly common vocabulary size, so the first thing we can do is build a tokenizer with these 47.000 tokens as the vocabulary. 

## EDA (CELEX)

Let's find some more things about our dataset. In the 9th column we have the morphemes of a word. There are four options here:
1. the entry is empty (because the word cannot be segmented and the word itself is not a morpheme)
2. the entry has one morpheme that is identical with the word
3. the entry has one morpheme that is not identical with the word
4. the entry contains multiple morphemes, that when concatenated are identical to the word
5. the entry contains multiple morphemes, that when concatenated are not identical to the word

Let's see how often these things occur. We will first do this for the initial segmentations, so without an extra segmentation of parts:


In [1891]:


# def initial_stats_celex(celex, print_info=True):

#     n1 = {}
#     n2 = {}
#     n3 = {}
#     n4 = {}
#     n5 = {}
#     doubles = {}


#     with open(celex) as cd:
#         for i, line in enumerate(cd):
            
#             line = line.strip().split('\\')
            
#             word = line[1]
#             seg = line[8]
#             morph = line[12]
            
#             if word in n1 or word in n2 or word in n3 or word in n4 or word in n5:
#                 doubles[word] = i+1

#             if len(seg) == 0:
#                 n1[word] = i+1
            
#             else:
#                 if not '+' in seg:
#                     if word == seg:
#                         n2[word] = i+1
#                     else:
#                         n3[word] = (i+1, seg)
            
#                 else:
#                     split = seg.split('+')
#                     concat = ''.join(split)

#                     if concat == word:
#                         n4[word] = i+1
#                     else:
#                         n5[word] = (i+1, seg)

#     j = len(n1) + len(n2) + len(n3) + len(n4) + len(n5)

#     print('---- Numbers for initial segmentation ----')
#     print(f'Option 1 occurs {len(n1)} times ({round(100*len(n1)/j, 1)}%)')
#     print(f'Option 2 occurs {len(n2)} times ({round(100*len(n2)/j, 1)}%)')
#     print(f'Option 3 occurs {len(n3)} times ({round(100*len(n3)/j, 1)}%)')
#     print(f'Option 4 occurs {len(n4)} times ({round(100*len(n4)/j, 1)}%)')
#     print(f'Option 5 occurs {len(n5)} times ({round(100*len(n5)/j, 1)}%)')



#     # do the same after an extra loop
    
# def stats_after_loop(segmentations):

#     n1 = {}
#     n2 = {}
#     n3 = {}
#     n4 = {}
#     n5 = {}

#     for word, seg in segmentations.items():
#         if len(seg) == 0:
#             n1[word] = seg
#         if len(seg) == 1:
#             if word == seg[0]:
#                 n2[word] = seg
#             else:
#                 n3[word] = seg
#         if len(seg) > 1:
#             if word == ''.join(seg):
#                 n4[word] = seg
#             else:
#                 n5[word] = seg 


#     j = len(n1) + len(n2) + len(n3) + len(n4) + len(n5)

#     print('---- Numbers after extra segmentation step ----')
#     print(f'Option 1 occurs {len(n1)} times ({round(100*len(n1)/j, 1)}%)')
#     print(f'Option 2 occurs {len(n2)} times ({round(100*len(n2)/j, 1)}%)')
#     print(f'Option 3 occurs {len(n3)} times ({round(100*len(n3)/j, 1)}%)')
#     print(f'Option 4 occurs {len(n4)} times ({round(100*len(n4)/j, 1)}%)')
#     print(f'Option 5 occurs {len(n5)} times ({round(100*len(n5)/j, 1)}%)')



We can observe that there are words that occur more than once in the dataset. Let's see how many times:

In [10]:
# def count_duplicates(celex):

#     dub = {}
#     duplicates = {}

#     with open(celex) as cd:
#         for i, line in enumerate(cd):
            
#             line = line.strip().split('\\')
            
#             word = line[1]
#             seg = line[8]
#             morph = line[12]

#             if not word in dub:
#                 dub[word] = seg
            
#             else:
#                 if not dub[word] == seg:
#                     duplicates[word] = (i, dub[word], seg)


#     print(f'Duplicates occur {len(duplicates)} times ({round(len(duplicates)/i, 4)}%)')


Duplicates occur 539 times (0.0043%)


## Creating a segmentation dictionary from CELEX

### Functions to create dictionaries

In [82]:
import re
import copy
import json

def load_json(path):

    with open(path, 'r') as f:
        my_dict = json.load(f)
    return my_dict

def store_json(path, object):
    with open(path, 'w') as f:
        json.dump(object, f)




def extract_substrings(input_string):
    
    # Regular expression to match sequences of letters
    pattern = re.compile(r'([a-zA-Z]+)')
    
    # Find all matches
    matches = pattern.findall(input_string)
    
    return [part for part in matches if not part in ['N', 'V', 'P', 'A', 'PA', 'PV']]



def create_initial_dataframe(celex):
    
    new_dict2 = {}

    with open(celex) as cd:
        for line in cd:

            
            line = line.strip().split('\\')

            
            if line[12] == '':
                cat = line[-1]
            else:
                cat = line[12][-2]

            word = line[1]
            seg = line[8]
            morph = line[12]
            parts = extract_substrings(morph)

            new_dict2[word] = {'cat': cat, 'segments1': seg, 'segments2': parts, 'info': morph}

    return new_dict2


def create_segmentations_from_base(base, only_same_spelling=False):

    new_dict_updated2 = {}

    for word, dic in base.items():
        
        seg = dic['segments1']
        split1 = seg.split('+')
        split2 = dic['segments2']

        concat1 = ''.join(split1)
        concat2 = ''.join(split2)

        if only_same_spelling:
            if word == concat2:
                new_dict_updated2[word] = split2
            else:
                if word == concat1:
                        new_dict_updated2[word] = split1
                else:
                    if len(base[word]['segments1']) == 0 and len(base[word]['segments2']) == 0:
                        new_dict_updated2[word] = []
        else:
            if word == concat2:
                new_dict_updated2[word] = split2
            else:
                if word == concat1:
                        new_dict_updated2[word] = split1
                else:
                    if len(base[word]['segments1']) == 0 and len(base[word]['segments2']) == 0:
                        new_dict_updated2[word] = []
                    else:
                        if len(split1) > len(split2):
                            new_dict_updated2[word] = split1
                        else:
                            new_dict_updated2[word] = split2
        
    return new_dict_updated2


def add_basic_verbs(df, base):

    new_dict_updated2 = copy.deepcopy(df)

    for word, dic in base.items():
        
        seg = dic['segments1']
        split1 = seg.split('+')
        split2 = dic['segments2']

        concat1 = ''.join(split1)
        concat2 = ''.join(split2)

        if dic['cat'] == 'V' and concat2 + 'en' == word:
            split2.append('en')
            new_dict_updated2[word] = split2

        else:
            if dic['cat'] == 'V' and concat1 + 'en' == word:
                split1.append('en')
                new_dict_updated2[word] = split1
  
    
    return new_dict_updated2



def create_segmentations_extra_loop(df):
    
    segmentations_new = {}

    n1 = 0

    for word, segments in df.items():
        seg = []
        for unit in segments:
            if not unit in df:
                seg.append(unit)
            else:
                if len(df[unit]) == 0:
                    seg.append(unit)
                elif len(df[unit]) == 1:   # note: we must choose whether we want to replace words that have a single morpheme that is not identical with the word
                    #seg.append(unit)      # we do this in another function now, so I don't do it here
                    seg.append(df[unit][0])
                else:
                    seg += df[unit] 
    
        segmentations_new[word] = seg
    
    return segmentations_new


# this function adds all the morphemes in a dictionary to the dictionary with the morpheme as key and as value
def add_morphemes_to_dict(d):

    dic = d
    n = 0

    morfs = set([])

    for word, segs in dic.items():
        for seg in segs:
            morfs.add(seg)

    for morf in morfs:
        if not morf in dic or len(dic[morf]) == 0:
            n += 1
            dic[morf] = [morf]

    return dic



# this function adds the word as segmentation of itself for all words that have an empty list as segmentation
def add_empty_segmentations(df):

    out = {}

    for word, seg in df.items():
        if len(seg) == 0:
            out[word] = [word]
        else:
            out[word] = seg

    return out


# this function replaces the single morphemes that are not identical with the word with the word
def replace_non_identical_morphs(df):
     


    out = {}

    for word, seg in df.items():
        if len(seg) == 1 and not word == seg[0]:
            out[word] = [word]
        else:
            out[word] = seg
    
    return out




# create dictionary with related words for every word
def create_word_fams(celex2):

    word_fams = {}

    with open(celex2) as cd:
        for line in cd:
            line = line.strip().split('\\')
            word = line[1]
            fam = line[2]
            word_fams[word] = fam
    
    return word_fams


# function to make 'inverse' dict by value
def group_keys_by_value(input_dict):

    value_to_keys = {}
    for key, value in input_dict.items():
        if value not in value_to_keys:
            value_to_keys[value] = []
        value_to_keys[value].append(key)
  
    output_dict = {key: [k for k in value_to_keys[input_dict[key]] if k != key] for key in input_dict}
    
    return output_dict


# function to create extra segmentation dataframe for a suffix
def create_extra_dataframe(df, word_fams, suffix):

    word_groups = group_keys_by_value(word_fams)
    
    related_words = {word: rels for word, rels in word_groups.items() if word in initial_dataframe}

    plus = {}
    segmentations_extra = {}

    for word in segmentations:
        for rel in related_words[word]:
            if not rel in segmentations and word + suffix == rel:
                plus[word] = rel

    for word, mult in plus.items():
        segmentations_extra[mult] = segmentations[word] + [suffix]
    
    return segmentations_extra



def remove_ortho_changes(df):


    return {word: seg for word, seg in df.items() if ''.join(seg) == word}


# function to add conjugations of verbs
# the non_words parameter is for the greedy / non-greedy approach
def create_verb_segmentations(base, groups, non_words=False):

    extra_segmentations = {}

    if non_words:

        for word, dic in base.items():
            
            seg = dic['segments1']
            split1 = seg.split('+')
            split2 = dic['segments2']

            concat1 = ''.join(split1)
            concat2 = ''.join(split2)


            if dic['cat'] == 'V' and concat2 + 'en' == word:  # waarom gebeurt dit nooit?


                extra_segmentations[concat2 + 'de'] = split2 + ['de']
                extra_segmentations[concat2 + 'den'] = split2 + ['den']
                extra_segmentations[concat2 + 'end'] = split2 + ['end']
                extra_segmentations[concat2 + 'ende'] = split2 + ['end', 'e']
                extra_segmentations[concat2 + 't'] = split2 + ['t']
                extra_segmentations['ge' + concat2 + 'd'] = ['ge'] + split2 + ['d']
                extra_segmentations['ge' + concat2 + 't'] = ['ge'] + split2 + ['t']

                extra_segmentations[concat2 + 'er'] = split2 + ['er']
                extra_segmentations[concat2 + 'eur'] = split2 + ['eur']
                extra_segmentations[concat2 + 'ster'] = split2 + ['ster']
                extra_segmentations[concat2 + 'euse'] = split2 + ['euse']

            
            else:
                
                if dic['cat'] == 'V' and concat1 + 'en' == word:


                    extra_segmentations[concat1 + 'de'] = split1 + ['de']
                    extra_segmentations[concat1 + 'den'] = split1 + ['den']
                    extra_segmentations[concat1 + 'end'] = split1 + ['end']
                    extra_segmentations[concat1 + 'ende'] = split1 + ['end', 'e']
                    extra_segmentations[concat1 + 't'] = split1 + ['t']
                    extra_segmentations['ge' + concat1 + 'd'] = ['ge'] + split1 + ['d']
                    extra_segmentations['ge' + concat1 + 't'] = ['ge'] + split1 + ['t']

                    extra_segmentations[concat1 + 'er'] = split2 + ['er']
                    extra_segmentations[concat1 + 'eur'] = split2 + ['eur']
                    extra_segmentations[concat1 + 'ster'] = split2 + ['ster']
                    extra_segmentations[concat1 + 'euse'] = split2 + ['euse']

                    # nog toevoegen: werkwoorden als wegfietsen -> weg-ge-fiets-t
                    # hoe herken je deze? niet-greedy is het wel te doen denk ik


    else:

        for word, dic in base.items():
            
            seg = dic['segments1']
            split1 = seg.split('+')
            split2 = dic['segments2']

            concat1 = ''.join(split1)
            concat2 = ''.join(split2)


            if dic['cat'] == 'V' and concat2 + 'en' == word:  # waarom gebeurt dit nooit?

                if concat2 + 'de' in groups[word] :
                    extra_segmentations[concat2 + 'de'] = split2 + ['de']
                if concat2 + 'den' in groups[word] :
                    extra_segmentations[concat2 + 'den'] = split2 + ['den']
                if concat2 + 'end' in groups[word]:
                    extra_segmentations[concat2 + 'end'] = split2 + ['end']
                if concat2 + 'ende' in groups[word]:
                    extra_segmentations[concat2 + 'ende'] = split2 + ['end', 'e']
                if concat2 + 't' in groups[word]:
                    extra_segmentations[concat2 + 't'] = split2 + ['t']
                if 'ge' + concat2 + 'd' in groups[word]:
                    extra_segmentations['ge' + concat2 + 'd'] = ['ge'] + split2 + ['d']
                if 'ge' + concat2 + 't' in groups[word]:
                    extra_segmentations['ge' + concat2 + 't'] = ['ge'] + split2 + ['t']
                if  concat2 + 'er' in groups[word]:
                    extra_segmentations[concat2 + 'er'] = split2 + ['er']
                if  concat2 + 'eur' in groups[word]:
                    extra_segmentations[concat2 + 'eur'] = split2 + ['eur']
                if  concat2 + 'ster' in groups[word]:
                    extra_segmentations[concat2 + 'ster'] = split2 + ['ster']
                if  concat2 + 'euse' in groups[word]:
                        extra_segmentations[concat2 + 'euse'] = split2 + ['euse']


            
            else:
                
                if dic['cat'] == 'V' and concat1 + 'en' == word:

                    if concat1 + 'de' in groups[word]:
                        extra_segmentations[concat1 + 'de'] = split1 + ['de']
                    if concat1 + 'den' in groups[word]:
                        extra_segmentations[concat1 + 'den'] = split1 + ['den']
                    if concat1 + 'end' in groups[word]:
                        extra_segmentations[concat1 + 'end'] = split1 + ['end']
                    if concat1 + 'ende' in groups[word]:
                        extra_segmentations[concat1 + 'ende'] = split1 + ['end', 'e']
                    if concat1 + 't' in groups[word]:
                        extra_segmentations[concat1 + 't'] = split1 + ['t']
                    if 'ge' + concat1 + 'd' in groups[word]:
                        extra_segmentations['ge' + concat1 + 'd'] = ['ge'] + split1 + ['d']
                    if 'ge' + concat1 + 't' in groups[word]:
                        extra_segmentations['ge' + concat1 + 't'] = ['ge'] + split1 + ['t']
                    if  concat2 + 'er' in groups[word]:
                        extra_segmentations[concat1 + 'er'] = split2 + ['er']
                    if  concat2 + 'eur' in groups[word]:
                        extra_segmentations[concat1 + 'eur'] = split2 + ['eur']
                    if  concat2 + 'ster' in groups[word]:
                        extra_segmentations[concat1 + 'ster'] = split2 + ['ster']
                    if  concat2 + 'euse' in groups[word]:
                            extra_segmentations[concat1 + 'euse'] = split2 + ['euse']
        
    return extra_segmentations


def add_plurals_(dic):

    out = {}

    morphemes = set([])
    for word, segs in dic.items():
        for seg in segs:
            morphemes.add(seg)
    
    for word, segs in dic.items():
        out[word] = segs
    
    for word, segs in dic.items():
        concat = ''.join(segs)
        if word == concat + 'en':
            out[word] = segs + ['en']
        if word == concat + 's':
            out[word] = segs + ['s']
        if word == concat + 'je':
            out[word] = segs + ['je']
        if word == concat + 'jes':
            out[word] = segs + ['je', 's']
        if word == concat + 'tje':
            out[word] = segs + ['tje']
        if word == concat + 'tjes':
            out[word] = segs + ['tje', 's']
    
    return out


def create_n_segmentations(df, n):

    return {word: segments for word, segments in df.items() if len(segments) >= n}




def create_noun_segmentations(df, groups, word_freqs, non_words=False, n_min=2):


    prefixes = ['be', 'ge', 'her', 'on', 'ont', 'tegen', 'ver', 'aarts', 'opper', 'super', 'hyper', 'ultra', 'wan', 'vice', 'vice-', 'sub', 'anti', 'pro', 'ex', 'ex-', 
                'oud-', 'oud', 'niet-', 'niet', 'non-', 'non', 'her', 'weder', 're-', 'oer', 'pre', 'pre-', 'post', 'post-' 'inter', 'auto', 'neo', 'neo-', 'pan', 'pseudo', 
                'pan-', 'pseudo-', 'pseudo', 'anti-']
    suffixes = ['aar', 'eur', 'achtig', 'es', 'aard', 'erd', 'heid', 'ig', 'erig', 'ij', 'in', 'ing', 'je', 'tje', 'lijk', 'schap', 'sel', 'te', 'teit',
               'tie', 'tor', 'trix', 'ette', 'trice', 's', 'e', 'schap', 'en']
    

    if non_words:

        segmentations_extra = {}

        for word in df:
            for suffix in suffixes:
                segmentations_extra[word + suffix] = df[word] + [suffix]
        
        for word in df:
            for prefix in prefixes:
                segmentations_extra[prefix + word] = [prefix] + df[word] 
    
    else:

        segmentations_extra = {}

        for word in df:
            for suffix in suffixes:
                if word + suffix in word_freqs:
                    segmentations_extra[word + suffix] = df[word] + [suffix]

        for word in df:
            for prefix in prefixes:
                if prefix + word in word_freqs:
                    segmentations_extra[prefix + word] = [prefix] + df[word] 

        
        for word in df:
            for suffix in suffix:
                for prefix in prefix:
                    if prefix + word + suffix in word_freqs:
                        segmentations_extra[prefix + word + suffix] = [prefix] + df[word] + [suffix]
        
        for word in df:
            for suffix in suffixes:
                for suffix2 in suffixes:
                    if word + suffix + suffix2 in word_freqs:
                        segmentations_extra[word + suffix + suffix2] = df[word] + [suffix] + [suffix2]
        

        for word in df:
            for prefix in prefixes:
                for prefix2 in suffixes:
                    if prefix + prefix2 + word in word_freqs:
                        segmentations_extra[prefix + prefix2 +word] = [prefix] + [prefix2] + df[word]

        # for word in df:
        #     for suffix in suffixes:
        #         for suffix2 in suffixes:
        #             for prefix in prefixes:
        #                 if prefix + word + suffix + suffix2 in word_freqs:
        #                     segmentations_extra[prefix + word + suffix + suffix2] = [prefix] + df[word] + [suffix] + [suffix2]

        # for word in df:
        #     for prefix in prefixes:
        #         for prefix2 in suffixes:
        #             for suffix in suffixes:
        #                 if prefix + prefix2 + word + suffix in word_freqs:
        #                     segmentations_extra[prefix + prefix2 + word + suffix] = [prefix] + [prefix2] + df[word] + [suffix]

        


        

    # else:

    #     related_words = {word: rels for word, rels in groups.items() if word in df}

    #     segmentations_extra = {}

    #     for word in df:
    #         if word in related_words:
    #             for rel in related_words[word]:
    #                 for suffix in suffixes:
    #                     if rel in df:
    #                         if len(df[rel]) < n_min and word + suffix == rel:  # OR if word + suffix == rel:
    #                             segmentations_extra[word + suffix] = df[word] + [suffix]
    #                     else:
    #                         segmentations_extra[word + suffix] = df[word] + [suffix]

        
    #     for word in df:
    #         if word in related_words:
    #             for rel in related_words[word]:
    #                 for prefix in prefixes:
    #                     if rel in df:
    #                         if len(df[rel]) < n_min and word + prefix == rel: # OR if word + suffix == rel:
    #                             segmentations_extra[prefix + word] = [prefix] + df[word] 
    #                     else:
    #                         segmentations_extra[prefix + word] = [prefix] + df[word] 
  
    
    return segmentations_extra





def add_compounds_(df, word_freqs, replace=False):


    extra = {}



    for word in df:
        for word2 in df:
            if word + word2 in word_freqs:
                extra[word + word2] = df[word] + df[word2]
    
    if replace:
        return df | extra

    else: 
        return extra | df






def merge_dictionaries_verb(dic1, dic2, replace=True, n_min=2):
    
    out = {word: segs for word, segs in dic1.items()}
    words_not_to_be_replaced = ['beurt', 'buit', 'dorst', 'geit', 'luid', 'geluid', 'pracht', 'rijt', 'ruit', 'geruit', 'sliert', 'spijt', 'spuit', 'tuit', 'vlijt', 'vorst']
    
    if replace:
        for word, segs in dic2.items():
            if not word in words_not_to_be_replaced:
                out[word] = segs
    else:
        for word, segs in dic2.items():
            if word in out:
                pass
            else:
                out[word] = segs
    
    return out




def merge_dictionaries_nouns(dic1, dic2, replace=True, n_min=2):
    
    out = {word: segs for word, segs in dic1.items()}
    
    if replace:
        for word, segs in dic2.items():
                out[word] = segs
    else:
        for word, segs in dic2.items():
            if word in out:
                if len(segs) >= 2 and segs[0] == 'on' and segs[1] == 'ge':
                    out[word] = segs
            else:
                out[word] = segs
    
    return out



def count_morphemes(df):

    total = set([])
    out = set([])
    for word, segs in df.items():
        out.add(segs[0])
        for seg in segs:
            total.add(seg)
    print(f'There are {len(total)} morphemes, out of these {len(out)} appear at the begining of a word')




def remove_words_not_in_corpus(df, word_freqs, treshold=0):

    out = {}

    if treshold > 0:
        for word, segs in df.items():
            if word in word_freqs:
                if word_freqs[word] > treshold:
                    out[word] = segs

    else:

        for word, segs in df.items():
            if word in word_freqs:
                out[word] = segs
    
    return out









####################################################################

###  Complete function to create a dictionary from the database  ###

####################################################################


def create_segmentation_dictionary(segmentation_data, word_family_data, word_freqs, extra_loop=True, add_morphemes=True, 
                                   add_empty=False, add_plurals=True, replace_non_identical=False, add_verbs=True, greedy_verb=False,
                                   add_nouns=True, greedy_noun=False, replace_verbs=True, replace_nouns=False, min_n_segments=1, 
                                   add_compounds=True, replace_compounds=False, remove_ortho=True, remove_not_in_corpus=False, meta_data=False, print_info=True):
    
    # create base dictionary of dictionaries for every word in the dataset
    base = create_initial_dataframe(segmentation_data)
    stats = {}

    # create initial segmentation dictionary
    dic = create_segmentations_from_base(base)

    # add basic verbs
    dic = add_basic_verbs(dic, base)

    # print stats
    n0 = len([word for word, segs in dic.items() if len(segs) == 0])
    n1 = len([word for word, segs in dic.items() if len(segs) == 1])
    n2 = len([word for word, segs in dic.items() if len(segs) > 1])
    stats['size_0'] = n0
    stats['size_1'] = n1
    stats['size_2+'] = n2
    if print_info:
        print(f'''There are {len(dic)} entries in the database. Out of these:
          - {n0} words have no segmentations
          - {n1} words have a single morpheme as segmentation 
          - {n2} words are split up into multiple morphemes\n''')


    # add possible further segmentations of segments
    d = copy.deepcopy(dic)
    if extra_loop:
        dic = create_segmentations_extra_loop(dic)
    
        # print stats
        loop_changes = [word for word in dic if not dic[word] == d[word] or not word in d]
        n3 = len(loop_changes)
        stats['loop changes'] = loop_changes
        stats['number of loop changes'] = n3
        if print_info:
            print(f'- Including the extra loop has increased the number of morphemes for {n3} words [total size = {len(dic)}]')
    
    
    
    # add every morpheme in the segmentations as entry to the dictionary
    d = copy.deepcopy(dic)
    if add_morphemes:
        dic = add_morphemes_to_dict(dic)

        # print stats
        morphs_added = set(dic) - set(d)
        n4 = len(morphs_added)
        stats['morphemes added'] = morphs_added
        stats['number of morphs added'] = n4
        if print_info:
            print(f'- {n4} Morphemes were added as entry to the dictionary [total size = {len(dic)}]')
    
    
    
    # add every word with no segmentation to the dictionary with itself as value
    d = copy.deepcopy(dic)
    if add_empty:
        dic = add_empty_segmentations(dic)
        
        # print stats
        added_words = set([word for word, seg in d.items() if len(seg) == 0]) - set([word for word, seg in dic.items() if len(seg) == 0])
        n5 = len(added_words)
        stats['identical words added'] = list(added_words)
        stats['number of idenitical words added'] = n5
        if print_info:
            print(f'- By choosing to add the words with no segmentation to the dictionary, {n5} segmentations were added [total size = {len(dic)}]')
    

    # add plurals
    d = copy.deepcopy(dic)
    if add_plurals:
        dic = add_plurals_(dic)

    
    # replace the single morphemes that are not identical with the word with the word
    d = copy.deepcopy(dic)
    stats['non identical words'] = [word for word, seg in d.items() if len(seg) == 1 and not word == seg[0]]
    if replace_non_identical:
        dic = replace_non_identical_morphs(dic)
        words_replaced = set([word for word, seg in d.items() if len(seg) == 1 and not word == seg[0]]) - set([word for word, seg in dic.items() if len(seg) == 1 and not word == seg[0]])
        n6 = len(words_replaced)
        stats['number of non identical words replaced'] = n6
        
        # print stats
        if print_info:
            print(f'- {n6} Non-identical single morph words were replaced with the identical word [total size = {len(dic)}]')
    

    # load data with word families
    word_fams = create_word_fams(word_family_data)
    word_groups = group_keys_by_value(word_fams)


    # add verb conjugation
    d = copy.deepcopy(dic)
    if add_verbs:
        if greedy_verb:
            extra = create_verb_segmentations(base, word_groups, True)
            if replace_verbs:
                dic = merge_dictionaries_verb(dic, extra, replace=True, n_min=2)
                # if remove_not_in_corpus:
                #     dic = remove_words_not_in_corpus(dic, word_freqs)
            else:
                dic = merge_dictionaries_verb(dic, extra, replace=False)
                # if remove_not_in_corpus:
                #     dic = remove_words_not_in_corpus(dic, word_freqs)
            

        else:
            extra = create_verb_segmentations(base, word_groups, False)
            if replace_verbs:
                dic = merge_dictionaries_verb(dic, extra, replace=True, n_min=2)
                # if remove_not_in_corpus:
                #     dic = remove_words_not_in_corpus(dic, word_freqs)
            else:
                dic = merge_dictionaries_verb(dic, extra, replace=False)
                # if remove_not_in_corpus:
                #     dic = remove_words_not_in_corpus(dic, word_freqs)
            
        
        # print stats
        verb_additions = set(dic) - set(d)
        n7 = len(verb_additions)
        stats['verb conjugates added'] = verb_additions
        stats['number of verb conjugates added'] = n7
        if print_info:
            if greedy_verb:
                print(f'- By choosing to add verbs with a greedy approach, {n7} verb conjugates were added to the dictionary [total size = {len(dic)}]')
            else:
                print(f'- By choosing to add verbs with a non-greedy approach, {n7} verb conjugates were added to the dictionary [total size = {len(dic)}]')

    
    # add noun conjugation
    d = copy.deepcopy(dic)
    if add_nouns:
        if greedy_noun:
            extra = create_noun_segmentations(dic, word_groups, word_freqs, True)
            if replace_nouns:
                dic = merge_dictionaries_nouns(dic, extra, replace=True, n_min=2)
            else:
                dic = merge_dictionaries_nouns(dic, extra, replace=False)
        else:
            extra = create_noun_segmentations(dic, word_groups, word_freqs, False)
            if replace_nouns:
                dic = merge_dictionaries_nouns(dic, extra, replace=True, n_min=2)
            else:
                dic = merge_dictionaries_nouns(dic, extra, replace=False)
        
        # print stats
        noun_additions = set(dic) - set(d)
        n8 = len(noun_additions)
        stats['noun conjugates added'] = noun_additions
        stats['number of noun conjugates added'] = n8
        if print_info:
            if greedy_noun:
                print(f'- By choosing to add nouns with a greedy approach, {n8} noun conjugates were added to the dictionary [total size = {len(dic)}]')
            else:
                print(f'- By choosing to add nouns with a non-greedy approach, {n8} noun conjugates were added to the dictionary [total size = {len(dic)}]')
    


    # remove the enties that have less than n morphemes (standard: delete empty segmentations)
    d = copy.deepcopy(dic)
    dic = {word: segs for word, segs in dic.items() if len(segs) >= min_n_segments}
    words_deleted = set(d) - set(dic)
    n9 = len(words_deleted)
    stats['empty words deleted'] = words_deleted
    stats['number of empty words deleted'] = n9
    if print_info:
        print(f'- By choosing to remove the words with less than {min_n_segments} morpheme(s), {n9} words were deleted [total size = {len(dic)}]')


    # remove words with orthograpic changes in the segmentation
    if remove_ortho:
        d = copy.deepcopy(dic)
        dic = remove_ortho_changes(dic)
        
        # print stats 
        removed = set(d) - set(dic)
        n10 = len(removed)
        stats['ortho words deleted'] = removed
        stats['number of ortho words deleted'] = n10
        if print_info:
            print(f'- By choosing to remove the words where a change in spelling in the segmentations occurs, {n10} words were removed [total size = {len(dic)}]')
    


    # remove words that do not occur in a corpus
    if remove_not_in_corpus:
        dic = remove_words_not_in_corpus(dic, word_freqs)

    # add compounds
    if add_compounds:
        if replace_compounds:
            dic = add_compounds_(dic, word_freqs, True)
        else:
            dic = add_compounds_(dic, word_freqs, False)


    # return just the dictonary, or add meta-data
    if meta_data:
        return dic, stats
    else:
        return dic


    
    
    

In [41]:
a = {1: 10, 2: 20}
b = {2: 30, 3: 40}

a | b

#b | a

{1: 10, 2: 30, 3: 40}

### Functions to compare dictionaries

In order to evaluate the tokenizers, we need the following elements:
- the simlex words
- a corpus in the form of a generator
- a corpus in the form of a frequency dictionary

We will create various different versions of these three elements


In [6]:
import string
from tqdm import tqdm
from transformers import AutoTokenizer
import pandas as pd


def preprocess_basic(seq):
    return [s.strip(string.punctuation) for s in seq.strip().split()]

def preprocess_lower(seq):
    return [s.strip(string.punctuation) for s in seq.strip().lower().split()]

def preprocess_bpe(seq, tokenizer):
    return [word.replace('Ġ', '') for word in [word for word, offset in tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(seq)]]


# this function counts the number of times words in a dataframe occur in a corpus, taking the corpus as generator
def count_frequency_from_generator(words, corpus_generator):

    words_in = 0
    words_out = 0

    for i in corpus_generator:
        text = preprocess_lower(i) # andere preproces mogelijk
        for word in text:
            if word in words:
                words_in += 1
            else:
                words_out += 1
    
    print(f'{round(100 * words_in / (words_in + words_out), 1)}% of words in the corpus are in the celex database')
    print(f'{round(100 * words_out / (words_in + words_out), 1)}% of words are not')

    return (words_in, words_out)





# this function creates a word frequency dictionary for a corpus, so that we can count faster later on
def create_word_freqs_from_corpus(corpus_generator, sorted=False):
    
    word_freqs = {}

    for i in corpus_generator:     
        text = preprocess_basic(i)
        for word in text:
            if word in word_freqs:
                word_freqs[word] += 1
            else:
                word_freqs[word] = 1
    
    if sorted:
        word_freqs = dict(sorted(word_freqs.items(), key=lambda item: item[1], reverse=True))
    
    return word_freqs



# this function creates a word frequency dictionary for a corpus, so that we can count faster later on
# a progress bar was added
def create_word_freqs_from_local_corpus(corpus_generator, sort=False, progress=True, path=0, prep_lower=False):
    
    if progress:
        print(f'Estimating the size of the dataset ...\n')
        if path == 0:
            assert path == 1, 'Enter path to get progress bar, or set progress=False to perform the function without one'
        else:
            size = get_size_for_local(path)
    
        word_freqs = {}
        
        for i in range(1):
            if i == 0:
                print(f'Data size: {format_with_dots(size)} lines of text! Generating the frequency dictionary ...')

        if prep_lower:

            for i in tqdm(corpus_generator, total=size, desc="Progress", unit=" iterations"):     
                text = preprocess_lower(i)
                for word in text:
                    if word in word_freqs:
                        word_freqs[word] += 1
                    else:
                        word_freqs[word] = 1
        
        else:

            for i in tqdm(corpus_generator, total=size, desc="Progress", unit=" iterations"):     
                text = preprocess_basic(i)
                print(text)
                for word in text:
                    if word in word_freqs:
                        word_freqs[word] += 1
                    else:
                        word_freqs[word] = 1
    
    else:

        print('Performing task without progress bar')

        word_freqs = {}

        for i in corpus_generator:     
            text = preprocess_basic(i)
            for word in text:
                if word in word_freqs:
                    word_freqs[word] += 1
                else:
                    word_freqs[word] = 1
        
    if sort:
        word_freqs = dict(sorted(word_freqs.items(), key=lambda item: item[1], reverse=True)) 
    
    return word_freqs


# this function creates a frequency dictionary for a corpus loaded with a generator, especially usefull for streaming from huggingface server
def create_word_freqs_from_online_corpus(corpus_generator, sorted=False, progress=True, avg_size=6750000, n_files=45):

    size = n_files * avg_size
    
    if progress:
        print(f'The estimated size of the entire corpus is around {format_with_dots(size)} lines of text!')
        
        word_freqs = {}

        for i in range(1):
            if i == 0:
                print(f'Generating the frequency dictionary ...\n')

        for i in tqdm(corpus_generator, total=size, desc="Progress", unit=" iterations"):     
            text = preprocess_basic(i['text'])
            for word in text:
                if word in word_freqs:
                    word_freqs[word] += 1
                else:
                    word_freqs[word] = 1
    
    return word_freqs



def get_size_for_local(path):
    with open(path) as osc:
        n = 0
        for line in osc:
            n+= 1
        return n 

def format_with_dots(number):
    return f"{number:,}".replace(",", ".")


# this function counts how many of the words in a dataframe occur at least once in the corpus, which is printed
# it returns two lists, one with the words in the corpus, one with the words not in the corpus
# update: now also returns frequencies
def dataset_in_corpus(df, word_freqs, lower=False, print_info=True):

    words_in = {}
    words_out = []

    if lower:

        word_freqs_lower = {key.lower(): value for key, value in word_freqs.items()}

        for word in df:
            if word in word_freqs or word in word_freqs_lower:
                if word in words_in:
                    if word in word_freqs:
                        words_in[word] += word_freqs[word]
                    else:
                        words_in[word] += word_freqs_lower[word]
                else:
                    if word in word_freqs:
                        words_in[word] = word_freqs[word]
                    else:
                        words_in[word] = word_freqs_lower[word]

            else:
                words_out.append(word)
    
    else:
    
        for word in df:
            if word in word_freqs:
                if word in words_in:
                    words_in[word] += word_freqs[word]
                else:
                    words_in[word] = word_freqs[word]
            
            else:
                words_out.append(word)

    
    n_in = len(words_in)
    n_out = len(words_out)
    
    if print_info:
        print(f'{round(100 * n_in / (n_in + n_out), 1)}% of words in the dataset are in the corpus')
        print(f'{round(100 * n_out / (n_in + n_out), 1)}% of words are not\n')

    return words_in, words_out


# this function calculates the number of words in the corpus that are in a dataframe. 
# two prints are made: one for every word in the corpus that is in the dataframe, one for every unique word in the corpus that is in the dataframe
# the function returns four dictionaries (in/not in dataframe - all/unique)
def corpus_in_dataset(df, word_freqs, lower=False, print_info=True):

    n_in = 0
    n_out = 0

    n_in_abs = 0
    n_out_abs = 0

    words_in = []
    words_out = []

    if lower:

        for word in word_freqs:
            if word in df or word.lower() in df:
                n_in += word_freqs[word]
                n_in_abs += 1
                words_in.append(word)
            else:
                n_out += word_freqs[word]
                n_out_abs += 1
                words_out.append(word)
    
    else:

        for word in word_freqs:
            if word in df:
                n_in += word_freqs[word]
                n_in_abs += 1
                words_in.append(word)
            else:
                n_out += word_freqs[word]
                n_out_abs += 1
                words_out.append(word)
    
    if print_info:
        print(f'{round(100 * n_in / (n_in + n_out), 1)}% of words in the corpus are in the dataset')
        print(f'{round(100 * n_out / (n_in + n_out), 1)}% of words are not\n')

        print(f'{round(100 * n_in_abs / (n_in_abs + n_out_abs), 1)}% of unique words in the corpus are in the dataset')
        print(f'{round(100 * n_out_abs / (n_in_abs + n_out_abs), 1)}% of unique words are not\n')

    return words_in, words_out, n_in, n_out, n_in_abs, n_out_abs







# this function uses the previous two functions to give a 'complete' picture of the relationship between a dataframe and the corpus
def compare_dataset_and_corpus(df, word_freqs, lower_=False, print_info_=True):

    n_corpus = sum(word_freqs.values())
    n_dataset = len(df)

    if print_info_:
        print(f'There are {n_corpus} words in the corpus')
        print(f'There are {n_dataset} words in the dataset\n')

    a, b, c, d, e, f = corpus_in_dataset(df, word_freqs, lower=lower_, print_info=print_info_)
    g, h = dataset_in_corpus(df, word_freqs, lower=lower_, print_info=print_info_)
    
    #return g, b, f
    return {'in both': g, 'not in dataset': b, 'not in corpus': h, 'n in both': c, 'n not in dataset': d, 'n in corpus': len(g), 'n not in corpus': len(h)}







# this function is not finished yet
# def compare_segmentations_and_corpus(df, word_freqs, lower=False):

#     a = {word for word, segments in df.items() if len(segments) == 0}
#     b = {seg for segments in df.values() for seg in segments if len(segments) == 1}
#     c = {seg for segments in df.values() for seg in segments if len(segments) > 1}

#     tokens_max = a | b | c
#     tokens_min = c
#     tokens_x = b | c



# this function calculates the percentage of words that have no orthographic changes in the segmentation that have at least two moprhemes
# it does this in relation to a corpus,  so based on the number of times it occurs in the corpus
def count_ortho(df, word_freqs, lower=False):

    segmentations = {word: segments for word, segments in df.items() if len(segments) >= 2}

    seg = {word for word, segments in segmentations.items() if ''.join(segments) == word}
    no_seg = {word for word, segments in segmentations.items() if ''.join(segments) != word}

    n_in = 0
    n_out = 0

    n_total = sum(word_freqs.values())

    if lower:

        for word in word_freqs:
            if word in seg or word.lower() in seg:
                n_in += word_freqs[word]
            elif word in no_seg or word.lower() in no_seg:
                n_out += word_freqs[word]
    
    else:

        for word in word_freqs:
            if word in seg:
                n_in += word_freqs[word]
            elif word in no_seg:
                n_out += word_freqs[word]


    print(f'{round(100 * n_in / (n_in + n_out), 1)}% of segmentable words in the corpus are segmented via the dict')
    print(f'This is equal to {round(100 * n_in / n_total, 1)}% of words in the corpus \n')



def sort_dictionary(df, descending=True):

    if descending:
         sorted_dict = dict(sorted(df.items(), key=lambda item: item[1], reverse=True))
    else:
        sorted_dict = dict(sorted(df.items(), key=lambda item: item[1]))
    
    return sorted_dict



def combine_dictionaries(*dictionaries):
    out = {}
    for dictionary in dictionaries:
        for word, freq in dictionary.items():
            if word in out:
                out[word] += freq
            else:
                out[word] = freq
    return out




def compare_multiple_dicts_and_copora(frequency_dictionaries, datasets):
    
    results1 = {}
    results2 = {}
    
    for i, freq in enumerate(frequency_dictionaries):
        x = f'Corpus {i+1}'
        results1[x] = {}
        results2[x] = {}


        for j, data in enumerate(datasets):


            y = f'Segmentations {j+1}'
    

            result = compare_dataset_and_corpus(data, freq, print_info_=False)
            in_both = result['in both']
            not_in_data = result['not in dataset']
            not_in_corpus = result['not in corpus']
            n_in_both = result['n in both']
            n_not_in_data = result['n not in dataset']
            n_in_corpus = result['n in corpus']
            n_not_in_corpus = result['n not in corpus']


            a = round(100 * n_in_both / (n_in_both + n_not_in_data), 1)
            b = round(100 * n_in_corpus / (n_in_corpus + n_not_in_corpus), 1)

            results1[x][y] = a
            results2[x][y] = b
    
    df1 = pd.DataFrame(results1) #.T
    df2 = pd.DataFrame(results2) #.T
            

    print(df1)
    print('\n')
    print(df2)


def compare_multiple_corpora(frequency_dictionaries, dataset):
    
    results1 = {}

    
    for i, freq in enumerate(frequency_dictionaries):
        x = f'Corpus {i+1}'
        results1[x] = {}

        result = compare_dataset_and_corpus(dataset, freq, print_info_=False)
        in_both = result['in both']
        not_in_data = result['not in dataset']
        not_in_corpus = result['not in corpus']
        n_in_both = result['n in both']
        n_not_in_data = result['n not in dataset']
        n_in_corpus = result['n in corpus']
        n_not_in_corpus = result['n not in corpus']


        a = round(100 * n_in_both / (n_in_both + n_not_in_data), 1)
        b = round(100 * n_in_corpus / (n_in_corpus + n_not_in_corpus), 1)

        results1[x]['A'] = a
        results1[x]['B'] = b

    
    df1 = pd.DataFrame(results1) #.T  

    return df1



def compare_multiple_dicts(frequency_dictionary, datasets):
    
    results1 = {}

    
    for i, dataset in enumerate(datasets):
        x = f'Segmentations dictionary {i+1}'
        results1[x] = {}

        result = compare_dataset_and_corpus(dataset, frequency_dictionary, print_info_=False)
        in_both = result['in both']
        not_in_data = result['not in dataset']
        not_in_corpus = result['not in corpus']
        n_in_both = result['n in both']
        n_not_in_data = result['n not in dataset']
        n_in_corpus = result['n in corpus']
        n_not_in_corpus = result['n not in corpus']


        a = round(100 * n_in_both / (n_in_both + n_not_in_data), 1)
        b = round(100 * n_in_corpus / (n_in_corpus + n_not_in_corpus), 1)

        results1[x]['Corpus words in segmentations dict (%)'] = a
        results1[x]['Segmentations dict words in corpus (%)'] = b

    
    df1 = pd.DataFrame(results1).T  

    return df1


def compare_only_dictionaries(dict1, dict2, return_dict=False):

    diff = []
    not_in_2 = []
    not_in_1 = []
    
    for word, segs in dict1.items():
        if word in dict2:
            if not segs == dict2[word]:
                diff.append(word)
        else:
            not_in_2.append(word)
    
    for word, segs in dict2.items():
        if word in dict1:
            if not segs == dict1[word]:
                diff.append(word)
        else:
            not_in_1.append(word)

    dic = {word: (dict1[word], dict2[word]) for word in diff}

    if return_dict:
        return dic, not_in_2, not_in_1
    else:
        return diff, not_in_2, not_in_1




def compare_sets(a, b, print_info=True):

    incl = []
    excl = []

    for word in a:
        if word in b:
            incl.append(word)
        else:
            excl.append(word)

    p_in = 100 * len(incl) / len(a)
    
    if print_info:
        print(f'{round(p_in, 2)}% of words in set a are in set b')
    
    return incl, excl


def compare_dict_simlex(sim, dic, print_info=True):

    incl = []
    excl = []
    mult = 0

    for word in sim:
        if word in dic:
            incl.append(word)
            if len(dic[word]) > 1:
                mult += 1
        else:
            excl.append(word)

    p_in = 100 * len(incl) / len(sim)
    p_mult = 100 * mult / len(sim)
    
    if print_info:
        print(f'{round(p_in, 2)}% of words in simlex are in the dictionary')
        print(f'{round(p_mult, 2)}% of words in simlex are in the dictionary and have more than one segment')






def return_number_of_morphemes(dic):
    out = set([])
    for word, segs in dic.items():
        for seg in segs:
            out.add(seg)
    return len(out)


def count_morphemes_extra(df, print_info=True):

    total = set([])
    begin = set([])
    mid = set([])

    for word, segs in df.items():
        begin.add(segs[0])
        if len(segs) > 1:
           for item in segs[1:]:
               mid.add(item)
        for seg in segs:
            total.add(seg)

    not_at_begin = mid - begin

    size = len(total) + len(begin) - len(not_at_begin)

    if print_info:
    
        print(f'There are {len(total)} morphemes, out of these {len(begin)} appear at the begining of a word')
        print(f'{len(not_at_begin)} morphemes do not appear at the beginning. This means that the vocabulary has a size of {size} tokens (only lowercase).')
    
    else:
        return size



def return_morphemes(df, form='list'):

    out = set([])
    for word, segs in df.items():
        for seg in segs:
            out.add(seg)
    
    if form == 'list':
        return list(out)
    else:
        return out







### Creating word frequency dictionaries for the corpus

In [7]:
# dictionary for first 4 parts of OSCAR
word_freqs_4 = load_json('word_freqs_all_lower.json')

# dictionary for first 20 parts of OSCAR
word_freqs_20 = load_json('/Users/jan/Documents/Master/Thesis/Code/Snellius/Outputs/frequencies20.json')

### Comparison of different dictionaries

We will now create some different versions of the segmentation dictionory. We can do this with one functions, which has a lot of boolean parameters to specify our choices. 



The possible choices we have are:
- extra_loop: chech whether a morpheme of a word is in the dictionary with a segmentation itself. If so, replace with this extra segmentation.
- add_morphemes: add the morphemes that are not in the dictionary as key to the dictionary, with as segmentation the identical morpheme
- add_empty: for the words with no segmentation, add the identical word as the segmentation of the word. 
- replace_non_identical: for words that have a single morpheme as segmentation that is not identical to the word, replace the morpheme with the identical word
- add_verbs: add the conjugation of every verb
- greedy_verb: do this in a way that also non-existent words will enter the dictionary
- add_nouns: add the conjugation of every noun
- greedy_noun: do this in a way that also non-existent words will enter the dictionary
- replace_conjugates: creating these inflections of words can lead to words that are already in the dictionary. Choose whether we want to replace them in this case. 
- min_n_segments: return only the words with at least n morphemes as segmentation. Automatically set to n = 1.
- remove_ortho: remove all words for which the concatenated morphemes are not identical to the word
- meta_data: return an additional dictionary with meta data about the dictionary
- print_info: print information related to (the creation of) the dictionary

In [9]:
segmentation_data = celex
word_family_data = celex2

word_fams = create_word_fams(word_family_data)
word_groups = group_keys_by_value(word_fams)

Effect of extra loop

Effect of adding morphemes

Effect of adding words with emppty segmentation

Effect of replacing words with one mopheme that is not identical

Effect of adding verbs

Effect of adding nouns

In [34]:



# no_nouns = create_segmentation_dictionary(segmentation_data, word_family_data, word_freqs_20, extra_loop=True, add_morphemes=True, 
#                                    add_empty=False, add_plurals=True, replace_non_identical=False, add_verbs=True, greedy_verb=True,
#                                    add_nouns=False, greedy_noun=False, replace_verbs=True, replace_nouns=False, min_n_segments=1, 
#                                    remove_ortho=True, remove_not_in_corpus=True, meta_data=False, print_info=False)

# with_nouns = create_segmentation_dictionary(segmentation_data, word_family_data, word_freqs_20, extra_loop=True, add_morphemes=True, 
#                                    add_empty=False, add_plurals=True, replace_non_identical=False, add_verbs=True, greedy_verb=True,
#                                    add_nouns=True, greedy_noun=False, replace_verbs=True, replace_nouns=False, min_n_segments=1, 
#                                    remove_ortho=True, remove_not_in_corpus=True, meta_data=False, print_info=False)


In [35]:
len(no_nouns)

87854

In [1]:
len(with_nouns)

NameError: name 'with_nouns' is not defined

Effect of replacing words with these newly formed words

In [37]:
# with_replace = create_segmentation_dictionary(segmentation_data, word_family_data, word_freqs_20, extra_loop=True, add_morphemes=True, 
#                                    add_empty=False, add_plurals=True, replace_non_identical=False, add_verbs=True, greedy_verb=True,
#                                    add_nouns=True, greedy_noun=False, replace_verbs=True, replace_nouns=True, min_n_segments=1, 
#                                    remove_ortho=True, remove_not_in_corpus=True, meta_data=False, print_info=False)


# no_replace = create_segmentation_dictionary(segmentation_data, word_family_data, word_freqs_20, extra_loop=True, add_morphemes=True, 
#                                    add_empty=False, add_plurals=True, replace_non_identical=False, add_verbs=True, greedy_verb=True,
#                                    add_nouns=True, greedy_noun=False, replace_verbs=True, replace_nouns=False, min_n_segments=1, 
#                                    remove_ortho=True, remove_not_in_corpus=True, meta_data=False, print_info=False)


In [40]:
diff = compare_only_dictionaries(no_replace, with_replace, return_dict=True)


Effect of replacing compunds

In [47]:
# comps_no_replace  = create_segmentation_dictionary(segmentation_data, word_family_data, word_freqs_20, extra_loop=True, add_morphemes=True, 
#                                    add_empty=False, add_plurals=True, replace_non_identical=False, add_verbs=True, greedy_verb=True,
#                                    add_nouns=True, greedy_noun=False, replace_verbs=True, replace_nouns=False, min_n_segments=1, 
#                                    add_compounds=True, replace_compounds=False, remove_ortho=True, remove_not_in_corpus=True, meta_data=False, print_info=False)

KeyboardInterrupt: 

In [52]:
# comps_replace = create_segmentation_dictionary(segmentation_data, word_family_data, word_freqs_20, extra_loop=True, add_morphemes=True, 
#                                    add_empty=False, add_plurals=True, replace_non_identical=False, add_verbs=True, greedy_verb=True,
#                                    add_nouns=True, greedy_noun=False, replace_verbs=True, replace_nouns=False, min_n_segments=1, 
#                                    add_compounds=True, replace_compounds=True, remove_ortho=True, remove_not_in_corpus=True, meta_data=False, print_info=False)

### Conclusion: the optimal dictionary

The final dictionary will be:

In [72]:
segmentation_dictionary = create_segmentation_dictionary(segmentation_data, word_family_data, word_freqs_20, extra_loop=True, add_morphemes=True, 
                                   add_empty=False, add_plurals=True, replace_non_identical=False, add_verbs=True, greedy_verb=True,
                                   add_nouns=True, greedy_noun=False, replace_verbs=True, replace_nouns=False, min_n_segments=1, 
                                   add_compounds=False, replace_compounds=False, remove_ortho=True, remove_not_in_corpus=True, meta_data=False, print_info=False)

In [81]:
segmentation_dictionary2 = add_compounds_(segmentation_dictionary, word_freqs_20, True)

KeyboardInterrupt: 

In [None]:
store_json('/Users/jan/Documents/Master/Thesis/Code/seg_dict_with_compounds.json', segmentation_dictionary2)

Let's save the dictionary in a json file:

In [73]:
store_json('/Users/jan/Documents/Master/Thesis/Code/seg_dict.json', segmentation_dictionary)

Test if there are words for which the segmentations do not add up to the the identical word:

In [74]:
d = segmentation_dictionary

for word in d:
    concat = ''.join(d[word])
    if word != concat:
        print(word, d[word])


Comparisons with the words in Dutch SimLex:

In [75]:
compare_dict_simlex(simlex_words, segmentation_dictionary)

76.65% of words in simlex are in the dictionary
19.71% of words in simlex are in the dictionary and have more than one segment


In [76]:
compare_dict_simlex(simlex_words, sd)

76.65% of words in simlex are in the dictionary
19.71% of words in simlex are in the dictionary and have more than one segment


Comparison with the text corpus:

In [77]:
a = corpus_in_dataset(segmentation_dictionary, word_freqs_20)

72.6% of words in the corpus are in the dataset
27.4% of words are not

1.3% of unique words in the corpus are in the dataset
98.7% of unique words are not



In [78]:
a = corpus_in_dataset(sd, word_freqs_20)

72.6% of words in the corpus are in the dataset
27.4% of words are not

1.3% of unique words in the corpus are in the dataset
98.7% of unique words are not



Count the number of tokens our vocabulary will have from this segmentation dictionary:

In [79]:
count_morphemes_extra(segmentation_dictionary)

There are 13602 morphemes, out of these 13299 appear at the begining of a word
303 morphemes do not appear at the beginning. This means that the vocabulary has a size of 26598 tokens (only lowercase).


## Tokenizer

Three parts:
- creating the vocabulary
- tokenizing text
- detokenizing tokens

### Creating the vocabulary




The vocabulary is formed from two sources:
- dataset with morphemes
- BPE

(In reality we there is not one vocabulary created)

#### Morphemes

We first create the set of morphemes, which we take from the segmentation dictionary.

In [31]:
seg_dict = segmentation_dictionary

# OR load from disk:
# seg_dict = 

morphemes = return_morphemes(seg_dict)

print(f'There are {len(morphemes)} in the dictionary. Due to a difference between the start and not-start of a word, this means {count_morphemes_extra(seg_dict, print_info=False)} tokens in the vocabulary.')

There are 13609 in the dictionary. Due to a difference between the start and not-start of a word, this means 26626 tokens in the vocabulary.


#### BPE

Based on the number of morphemes and the desired size of the vocabulary, we can train a BPE algorithm to tokenize the words that are not in our dictionary.

In order to do this we must preprocess the data in a way that takes the words out, and that converts it lowercase. 

The best way in our case is to create new text files, so we define a function that returens the converted text file.

In [106]:
path = '/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/Short/OSCAR_short.txt'

In [171]:
from transformers import AutoTokenizer
import datasets
from datasets import load_dataset

tokenizer_x = AutoTokenizer.from_pretrained("gpt2")
segmentation_dictionary = segmentation_dictionary

def lowercase_text(item):
    return item.lower()

def remove_words(item):
    text = tokenizer_x._tokenizer.pre_tokenizer.pre_tokenize_str(item)
    words_a = [word for word, offset in text]
    words_b = [word.replace('Ġ', '') for word in words_a]
    out = []
    for i, word in enumerate(words_b):
        if not word in segmentation_dictionary:
            out.append(words_a[i])
    return tokenizer_x.convert_tokens_to_string(out)

def convert_text(input):
    with open(input) as inp:
        with open(input[:-4] + '_converted.txt', 'w') as outp:
            for line in inp:
                line = lowercase_text(line)
                line = remove_words(line)
                outp.write(line)



Let's apply this function to one file here:

In [173]:
convert_text('/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/Short/OSCAR_short.txt')

Now that we have the text files to train BPE with, we can do this for a desired vocabulary size. 

This is the code to do it locally:

In [152]:
from tokenizers import ByteLevelBPETokenizer
import datasets

paths = '/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/Short/OSCAR_short_converted.txt'

# set size
desired_vocab_size = 35000
n_extra_tokens = desired_vocab_size - count_morphemes_extra(segmentation_dictionary, print_info=False)


# Initialize a tokenizer
tokenizer_30 = ByteLevelBPETokenizer()

# Customize training
tokenizer_30.train(files=paths, vocab_size=n_extra_tokens, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])


# save
tokenizer_30.save_model('/home/scur2141/tokenizer_test/t30')








Or

In [251]:
paths = '/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/Short/OSCAR_short_converted.txt'

# set size
desired_vocab_size = 35000
n_extra_tokens = desired_vocab_size - count_morphemes_extra(segmentation_dictionary, print_info=False)


def create_text_generator(gen):
    for i in gen:
        yield i['text']


path = '/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/Short/OSCAR_short.txt'
oscar_short = load_dataset('text', data_files={"train": path}, split='train')
oscar_short_it = load_dataset('text', data_files={"train": path}, split='train', streaming=True)

dataset = create_text_generator(oscar_short_it)


# load an existing BPE tokenizer
old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

new_tokenizer = old_tokenizer.train_new_from_iterator(dataset, n_extra_tokens)






But we have done this on the Snellus computer. So we will load them here:


In [276]:
from transformers import RobertaTokenizerFast

t30 = RobertaTokenizerFast.from_pretrained("/Users/jan/Documents/Master/Thesis/tokenizeXX/t30", max_len=512)


The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'. 
The class this function is called from is 'RobertaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Tokenizing text



Now that we have the two components for our tokenizer, we can combine them to form a single tokenizer. 

We will do this in a tokenizer class.

In [32]:
import random




# nog toevoegen:
#    - return_tensors method, die ervoor zorgt dat de input ids enzo als een tensor worden gegeven (of dit al standaard doen) 
# [deze zit in de call method zie voorbeeld: inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
# overigens heb je die andere methods ook nodig (padding & truncation)


## DEZE NIET AANPASSEN ##



# class CustomTokenizerMP:

#     def __init__(self, segmentation_dictionary, bpe_tokenizer):
        
#         self.bpe_tokenizer = bpe_tokenizer
#         self.bpe_vocab = self.bpe_tokenizer.get_vocab()


#         self.segmentations = {word: seg for word, seg in segmentation_dictionary.items() if len(seg) > 0}
#         self.seg_dict = {}
#         for word, segs in self.segmentations.items():
#             out = []
#             for i, seg in enumerate(segs):
#                 if i == 0:
#                     out.append('Ġ' + seg)
#                 else:
#                     out.append(seg)
#             self.seg_dict[word] = out
        
#         self.segments = {seg for segs in self.seg_dict.values() for seg in segs}
#         self.seg_vocab = {seg: (i + len(self.bpe_vocab)) for i, seg in enumerate(self.segments) if not seg in self.bpe_vocab}
        
#         self.vocab = self.bpe_vocab | self.seg_vocab
#         self.inverted_vocab = {value: key for key, value in self.vocab.items()}




#         self.segmentations = {word: seg for word, seg in segmentation_dictionary.items() if len(seg) > 0}
#         self.seg_dict = {}
#         for word, segs in self.segmentations.items():
#             out = []
#             for i, seg in enumerate(segs):
#                 if i == 0:
#                     out.append('Ġ' + seg)
#                 else:
#                     out.append(seg)
#             self.seg_dict[word] = out
        
#         self.segments = {seg for segs in self.seg_dict.values() for seg in segs}
#         self.seg_vocab = {seg: (i + len(self.bpe_vocab)) for i, seg in enumerate(self.segments) if not seg in self.bpe_vocab}
        
#         self.vocab = self.bpe_vocab | self.seg_vocab
#         self.inverted_vocab = {value: key for key, value in self.vocab.items()}


#     def get_vocab(self):
#         return self.vocab # note: we should probably use a getter here, but for now this is ok


#     def encode(self, seq):
#         text = self.bpe_tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(seq)
#         return text
#         # words_a = [word for word in [word for word, offset in text]]
#         # words_b = [word.replace('Ġ', '') for word in [word for word, offset in text]]
#         # #print(f'words_a: {words_a}')
#         # #print(f'words_b: {words_b}')
#         # out = []
#         # for i, word_b in enumerate(words_b):
#         #     if word_b in self.seg_dict:
#         #         out += [self.vocab[seg] for seg in self.seg_dict[word_b]]
#         #     else:
#         #         if words_a[i][0] == 'Ġ':
#         #             out += self.bpe_tokenizer.encode(' ' + word_b)
#         #         else: 
#         #             out += self.bpe_tokenizer.encode(word_b)
#         # #print(f'tokenization: {[self.inverted_vocab[id] for id in out]}')
#         # #print(f'out = {out}')
#         # return out


#     def decode(self, ids: list[int]):
#         assert type(ids) == list
#        # assert type(ids[0]) == int   # dit kan wel netter, volgens mij kan het al met alleen type hints, en dit gaat mis is de list leeg is
#         out = ''
#         for id in ids:
#             word = self.inverted_vocab[id]
#             if word[0] == 'Ġ':
#                 out += word.replace('Ġ', ' ')
#             else:
#                 out += word
#         return out
    
    
#     def tokenize(self, seq):
#         return [self.inverted_vocab[id] for id in self.encode(seq)] 
   

#     def __call__(self, seq):
#         ids = self.encode(seq)
#         #types = [0 for token in ids]
#         attention = [1 for token in ids]
#         #return {'input_ids': ids, 'token_type_ids': types, 'attention_mask': attention}
#         #return {'input_ids': ids}
#         return {'input_ids': ids, 'attention_mask': attention}



## DEZE NIET AANPASSEN ##




class CustomTokenizer:

    def __init__(self, segmentation_dictionary, bpe_tokenizer, max_length=512, pad_to_multiple_of=None):
        
        self.bpe_tokenizer = bpe_tokenizer
        self.bpe_vocab = self.bpe_tokenizer.get_vocab()

        self.segmentations = {word: seg for word, seg in segmentation_dictionary.items() if len(seg) > 0}
        self.seg_dict = {}
        for word, segs in self.segmentations.items():
            out = []
            for i, seg in enumerate(segs):
                if i == 0:
                    out.append('Ġ' + seg)
                else:
                    out.append(seg)
            self.seg_dict[word] = out
        
        self.segments = {seg for segs in self.seg_dict.values() for seg in segs}
        # self.seg_vocab = {seg: (i + len(self.bpe_vocab)) for i, seg in enumerate(self.segments) if not seg in self.bpe_vocab}
        
        # self.vocab = self.bpe_vocab | self.seg_vocab

        self.vocab = self.bpe_vocab.copy()
        
        for element in self.segments:
            if element not in self.vocab:
                self.vocab[element] = len(self.vocab) + 1

        self.mask_token = "<mask>"
        self.mask_token_id = self.vocab['<mask>']
        self.vocab[self.mask_token_id] = self.vocab['<mask>']
        
        self.cls_token = "<s>"
        self.cls_token_id = self.vocab['<s>']
        self.vocab[self.cls_token] = self.vocab['<s>']
        
        self.sep_token = "</s>"
        self.sep_token_id = self.vocab['</s>']
        self.vocab[self.sep_token] = self.vocab['</s>']
        
        self.pad_token = '<pad>'
        self.pad_token_id = self.vocab['<pad>']
        self.vocab[self.pad_token] = self.vocab['<pad>']
        


        # self.unk_token = '<|endoftext|>'
        # self.unk_token_id = self.vocab['<|endoftext|>']
        # self.vocab[self.unk_token] = self.vocab['<|endoftext|>']
        
        self.vocab['<unk>'] = len(self.vocab) + 1 
        self.unk_token = '<unk>'
        self.unk_token_id = self.vocab['<unk>']
        self.vocab[self.unk_token] = self.vocab['<unk>']
        
        # self.bos_token = '<|endoftext|>'
        # self.bos_token_id = self.vocab['<|endoftext|>']
        # self.vocab[self.bos_token] = self.vocab['<|endoftext|>']

        self.bos_token = '<s>'
        self.bos_token_id = self.vocab['<s>']
        self.vocab[self.bos_token] = self.vocab['<s>']
        
        self.eos_token = "</s>"
        self.eos_token_id = self.vocab["</s>"]
        self.vocab[self.eos_token] = self.vocab["</s>"]
        
        self.special_tokens = [self.vocab['<mask>'], self.vocab['<s>'], self.vocab['</s>'], self.vocab['<pad>'], self.vocab['<unk>']]
   
        
        self.vocab_size = len(self.vocab)

        self.inverted_vocab = {value: key for key, value in self.vocab.items()}

        self.max_length = max_length
        self.pad_to_multiple_of = pad_to_multiple_of

        


    def get_vocab(self):
        return self.vocab


    def tokenize(self, seq):

        # Implement your token to ID conversion logic here
        text = self.bpe_tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(seq)
        words_a = [word for word in [word for word, offset in text]]
        
        words_b = [word.replace('Ġ', '') for word in [word for word, offset in text]]
        tokens = []
        for i, word_b in enumerate(words_b):
            if word_b in self.seg_dict:
                tokens += self.seg_dict[word_b]
            else:
                if words_a[i][0] == 'Ġ':
                    tokens += self.bpe_tokenizer.tokenize(' ' + word_b)
                else: 
                    tokens += self.bpe_tokenizer.tokenize(word_b)    
        return tokens
        
        
    def encode(self, seq):


        return [self.bos_token_id] + self.convert_tokens_to_ids(self.tokenize(seq)) + [self.eos_token_id]

        


    def _convert_token_to_id(self, token):
        if token in self.vocab:
            return self.vocab[token]
        else:
            return self.unk_token_id

    
    def convert_tokens_to_ids(self, tokens):
        if isinstance(tokens, list):
            return [self._convert_token_to_id(token) for token in tokens]
        return self._convert_token_to_id(tokens)


    def _tokenize(self, seq):
        text = self.bpe_tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(seq)
        words_a = [word for word in [word for word, offset in text]]
        
        words_b = [word.replace('Ġ', '') for word in [word for word, offset in text]]
        tokens = []
        for i, word_b in enumerate(words_b):
            if word_b in self.seg_dict:
                tokens += self.seg_dict[word_b]
            else:
                if words_a[i][0] == 'Ġ':
                    tokens += self.bpe_tokenizer.tokenize(' ' + word_b)
                else: 
                    tokens += self.bpe_tokenizer.tokenize(word_b)    
        return tokens


    def __call__(self, text, return_tensors=None, padding=False, truncation=False, max_length=None, add_special_tokens=False):
        if isinstance(text, list):
            self.batch_encode_plus(text, truncation=truncation, return_tensors=return_tensors, padding=padding, max_length=max_length, add_special_tokens=add_special_tokens)
        else:

            token_ids = self.encode(text) 

            if truncation and max_length:
                token_ids = [self.bos_token_id] + token_ids[:max_length - 2] + [self.eos_token_id]
            if padding and max_length is not None:
                token_ids = token_ids + [self.pad_token_id] * (max_length - len(token_ids))
            if return_tensors == "pt":
                return {"input_ids": torch.tensor(token_ids, dtype=torch.long)}
            return {"input_ids": token_ids}


    def _convert_id_to_token(self, id):
        # Implement your ID to token conversion logic here
        return self.inverted_vocab[id]

    def convert_ids_to_tokens(self, ids):
        # Convert IDs back to tokens
        if isinstance(ids, list):
            return [self._convert_id_to_token(id) for id in ids]
        return self._convert_id_to_token(ids)
    
    def decode(self, ids):
        out = ''
        for id in ids:
            word = self._convert_id_to_token(id)
            if word[0] == 'Ġ':
                out += word.replace('Ġ', ' ')
            else:
                out += word
        return out
        

    def get_special_tokens_mask(self, token_ids, already_has_special_tokens=False):
        # Create a mask for special tokens
        return [1 if self._is_special_token(token_id) else 0 for token_id in token_ids]


    def _is_special_token(self, token_id):
        # Implement your logic to check if a token is a special token
        if token_id in self.special_tokens:
            return True
        else:
            return False


    def batch_encode_plus(self, texts, return_tensors=None, padding=False, truncation=False, max_length=None, add_special_tokens=False):
        batch_token_ids = [self.__call__(text, truncation=truncation, max_length=max_length, add_special_tokens=add_special_tokens)["input_ids"] for text in texts]
        
        if padding:
            if max_length is None:
                max_length = max(len(ids) for ids in batch_token_ids)
            batch_token_ids = [ids + [self.pad_token_id] * (max_length - len(ids)) for ids in batch_token_ids]
        
        if return_tensors == "pt":
            return {"input_ids": torch.tensor(batch_token_ids, dtype=torch.long)}
        return {"input_ids": batch_token_ids}


    def pad(self, batch, return_tensors="pt", pad_to_multiple_of=None):
        if pad_to_multiple_of is None:
            pad_to_multiple_of = self.pad_to_multiple_of


        input_ids_list = []
        for dictionary in batch:
            for key, value in dictionary.items():
                if isinstance(value, torch.Tensor):
                    input_ids_list.append(value.tolist())


        max_length = self.max_length or max(len(x) for x in input_ids_list)
        
        if pad_to_multiple_of is not None:
            max_length = (max_length + pad_to_multiple_of - 1) // pad_to_multiple_of * pad_to_multiple_of
        
        padded_batch = []
        for seq in input_ids_list:
            if len(seq) < max_length:
                seq.extend([self.pad_token_id] * (max_length - len(seq)))
            padded_batch.append(seq)
        
        attention_list = []
        for inner_list in padded_batch:
            p_list = [1 if value < self.vocab_size else 0 for value in inner_list]
            attention_list.append(p_list)
        
        if return_tensors == "pt":
            return {'input_ids': torch.tensor(padded_batch, dtype=torch.long), 'attention_mask': torch.tensor(attention_list, dtype=torch.long)}
        
        return {'input_ids': padded_batch, 'attention_mask': attention_list}

    
    def __len__(self):
        return self.vocab_size

    
        





In [None]:
class CustomTokenizerWP:

    def __init__(self, segmentation_dictionary, wp_tokenizer, max_length=512, pad_to_multiple_of=None):
        
        self.wp_tokenizer = wp_tokenizer
        self.wp_vocab = self.wp_tokenizer.get_vocab()

        self.segmentations = {word: seg for word, seg in segmentation_dictionary.items() if len(seg) > 0}
        self.seg_dict = {}
        for word, segs in self.segmentations.items():
            out = []
            for i, seg in enumerate(segs):
                if i == 0:
                    out.append(seg)
                else:
                    out.append('##' + seg)
            self.seg_dict[word] = out
        
        self.segments = {seg for segs in self.seg_dict.values() for seg in segs}
        # self.seg_vocab = {seg: (i + len(self.bpe_vocab)) for i, seg in enumerate(self.segments) if not seg in self.bpe_vocab}
        
        # self.vocab = self.bpe_vocab | self.seg_vocab

        self.vocab = self.wp_vocab.copy()
        
        for element in self.segments:
            if element not in self.vocab:
                self.vocab[element] = len(self.vocab) + 1
        

        self.l = len(self.vocab)

        self.vocab['<unk>'] = self.l + 1 
        self.unk_token = '<unk>'
        self.unk_token_id = self.vocab['<unk>']
        self.vocab[self.unk_token] = self.vocab['<unk>']

        self.vocab['<mask>'] = self.l + 2 
        self.mask_token = "<mask>"
        self.mask_token_id = self.vocab['<mask>']
        self.vocab[self.mask_token_id] = self.vocab['<mask>']
        
        self.vocab['<s>'] = self.l + 3
        self.cls_token = "<s>"
        self.cls_token_id = self.vocab['<s>']
        self.vocab[self.cls_token] = self.vocab['<s>']
        
        self.vocab['</s>'] = self.l + 4
        self.sep_token = "</s>"
        self.sep_token_id = self.vocab['</s>']
        self.vocab[self.sep_token] = self.vocab['</s>']
        
        self.vocab['<pad>'] = self.l + 5
        self.pad_token = '<pad>'
        self.pad_token_id = self.vocab['<pad>']
        self.vocab[self.pad_token] = self.vocab['<pad>']
        
        self.vocab['<s>'] = self.l + 6
        self.bos_token = '<s>'
        self.bos_token_id = self.vocab['<s>']
        self.vocab[self.bos_token] = self.vocab['<s>']
        
        self.vocab['</s>'] = self.l + 7 
        self.eos_token = "</s>"
        self.eos_token_id = self.vocab["</s>"]
        self.vocab[self.eos_token] = self.vocab["</s>"]
        
        self.special_tokens = [self.vocab['<mask>'], self.vocab['<s>'], self.vocab['</s>'], self.vocab['<pad>'], self.vocab['<unk>']]
   
        
        self.vocab_size = len(self.vocab)

        self.inverted_vocab = {value: key for key, value in self.vocab.items()}

        self.max_length = max_length
        self.pad_to_multiple_of = pad_to_multiple_of

        


    def get_vocab(self):
        return self.vocab


    def tokenize(self, seq):


        text = self.wp_tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(seq)
        words_a = [word for word in [word for word, offset in text]]
        
        words_b = [word.replace('##', '') for word in [word for word, offset in text]]
        tokens = []
        for i, word_b in enumerate(words_b):
            if word_b in self.seg_dict:
                tokens += self.seg_dict[word_b]
            else:
                if words_a[i][0] == '##':
                    tokens += self.wp_tokenizer.tokenize(word_b)
                else: 
                    tokens += self.wp_tokenizer.tokenize(' ' + word_b)    
        return tokens
        
        
    def encode(self, seq):


        return [self.bos_token_id] + self.convert_tokens_to_ids(self.tokenize(seq)) + [self.eos_token_id]

        


    def _convert_token_to_id(self, token):
        if token in self.vocab:
            return self.vocab[token]
        else:
            return self.unk_token_id

    
    def convert_tokens_to_ids(self, tokens):
        if isinstance(tokens, list):
            return [self._convert_token_to_id(token) for token in tokens]
        return self._convert_token_to_id(tokens)


    def _tokenize(self, seq):

        text = self.wp_tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(seq)
        words_a = [word for word in [word for word, offset in text]]
        
        words_b = [word.replace('##', '') for word in [word for word, offset in text]]
        tokens = []
        for i, word_b in enumerate(words_b):
            if word_b in self.seg_dict:
                tokens += self.seg_dict[word_b]
            else:
                if words_a[i][0] == '##':
                    tokens += self.wp_tokenizer.tokenize(word_b)
                else: 
                    tokens += self.wp_tokenizer.tokenize(' ' + word_b)    
        return tokens


    def __call__(self, text, return_tensors=None, padding=False, truncation=False, max_length=None, add_special_tokens=False):
        if isinstance(text, list):
            self.batch_encode_plus(text, truncation=truncation, return_tensors=return_tensors, padding=padding, max_length=max_length, add_special_tokens=add_special_tokens)
        else:

            token_ids = self.encode(text) 

            if truncation and max_length:
                token_ids = [self.bos_token_id] + token_ids[:max_length - 2] + [self.eos_token_id]
            if padding and max_length is not None:
                token_ids = token_ids + [self.pad_token_id] * (max_length - len(token_ids))
            if return_tensors == "pt":
                return {"input_ids": torch.tensor(token_ids, dtype=torch.long)}
            return {"input_ids": token_ids}


    def _convert_id_to_token(self, id):
        # Implement your ID to token conversion logic here
        return self.inverted_vocab[id]

    def convert_ids_to_tokens(self, ids):
        # Convert IDs back to tokens
        if isinstance(ids, list):
            return [self._convert_id_to_token(id) for id in ids]
        return self._convert_id_to_token(ids)
    
    def decode(self, ids):
        out = ''
        for id in ids:
            word = self._convert_id_to_token(id)
            if word[:2] == '##':
                out += word.replace('##', '')
            else:
                out += ' ' + word
        return out
        

    def get_special_tokens_mask(self, token_ids, already_has_special_tokens=False):
        # Create a mask for special tokens
        return [1 if self._is_special_token(token_id) else 0 for token_id in token_ids]


    def _is_special_token(self, token_id):
        # Implement your logic to check if a token is a special token
        if token_id in self.special_tokens:
            return True
        else:
            return False


    def batch_encode_plus(self, texts, return_tensors=None, padding=False, truncation=False, max_length=None, add_special_tokens=False):
        batch_token_ids = [self.__call__(text, truncation=truncation, max_length=max_length, add_special_tokens=add_special_tokens)["input_ids"] for text in texts]
        
        if padding:
            if max_length is None:
                max_length = max(len(ids) for ids in batch_token_ids)
            batch_token_ids = [ids + [self.pad_token_id] * (max_length - len(ids)) for ids in batch_token_ids]
        
        if return_tensors == "pt":
            return {"input_ids": torch.tensor(batch_token_ids, dtype=torch.long)}
        return {"input_ids": batch_token_ids}


    def pad(self, batch, return_tensors="pt", pad_to_multiple_of=None):
        if pad_to_multiple_of is None:
            pad_to_multiple_of = self.pad_to_multiple_of


        input_ids_list = []
        for dictionary in batch:
            for key, value in dictionary.items():
                if isinstance(value, torch.Tensor):
                    input_ids_list.append(value.tolist())


        max_length = self.max_length or max(len(x) for x in input_ids_list)
        
        if pad_to_multiple_of is not None:
            max_length = (max_length + pad_to_multiple_of - 1) // pad_to_multiple_of * pad_to_multiple_of
        
        padded_batch = []
        for seq in input_ids_list:
            if len(seq) < max_length:
                seq.extend([self.pad_token_id] * (max_length - len(seq)))
            padded_batch.append(seq)
        
        attention_list = []
        for inner_list in padded_batch:
            p_list = [1 if value < self.vocab_size else 0 for value in inner_list]
            attention_list.append(p_list)
        
        if return_tensors == "pt":
            return {'input_ids': torch.tensor(padded_batch, dtype=torch.long), 'attention_mask': torch.tensor(attention_list, dtype=torch.long)}
        
        return {'input_ids': padded_batch, 'attention_mask': attention_list}

    
    def __len__(self):
        return self.vocab_size

In [846]:

# import torch

# class CustomTokenizer:

#     def __init__(self, segmentation_dictionary, bpe_tokenizer):
        
#         self.bpe_tokenizer = bpe_tokenizer
#         self.bpe_vocab = self.bpe_tokenizer.get_vocab()

#         self.segmentations = {word: seg for word, seg in segmentation_dictionary.items() if len(seg) > 0}
#         self.seg_dict = {}
#         for word, segs in self.segmentations.items():
#             out = []
#             for i, seg in enumerate(segs):
#                 if i == 0:
#                     out.append('Ġ' + seg)
#                 else:
#                     out.append(seg)
#             self.seg_dict[word] = out
        
#         self.segments = {seg for segs in self.seg_dict.values() for seg in segs}
#         self.seg_vocab = {seg: (i + len(self.bpe_vocab)) for i, seg in enumerate(self.segments) if not seg in self.bpe_vocab}
        
#         self.vocab = self.bpe_vocab | self.seg_vocab
    
#         self.vs = len(self.vocab)
#         self.mask_token = "<mask>"
#         self.mask_token_id = self.vs + 1
#         self.vocab[self.mask_token_id] = self.vs + 1
#         self.cls_token = "<s>"
#         self.cls_token_id = self.vs + 2
#         self.vocab[self.cls_token] = self.vs + 2
#         self.sep_token = "</s>"
#         self.sep_token_id = self.vs + 3
#         self.vocab[self.sep_token] = self.vs + 3
#         self.pad_token = '<pad>'
#         self.pad_token_id = self.vs + 4
#         self.vocab[self.pad_token] = self.vs + 4
#         #self.unk_token = '<unk>'
#         self.unk_token = '<|endoftext|>'
#         self.unk_token_id = self.vs + 5
#         self.vocab[self.unk_token] = self.vs + 5
#         #self.bos_token = '<s>'
#         self.bos_token = '<|endoftext|>'
#         self.bos_token_id = self.vs + 6
#         self.vocab[self.bos_token] = self.vs + 6
#         #self.eos_token = '</s>'
#         self.eos_token = '<|endoftext|>'
#         self.eos_token_id = self.vs + 7
#         self.vocab[self.eos_token] = self.vs + 7
        
#         self.vocab_size = len(self.vocab)

#         self.inverted_vocab = {value: key for key, value in self.vocab.items()}



#     def get_vocab(self):
#         return self.vocab

    
#     def tokenize(self, seq):

#         # Implement your token to ID conversion logic here
#         text = self.bpe_tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(seq)
#         words_a = [word for word in [word for word, offset in text]]
        
#         words_b = [word.replace('Ġ', '') for word in [word for word, offset in text]]
#         tokens = []
#         for i, word_b in enumerate(words_b):
#             if word_b in self.seg_dict:
#                 tokens += self.seg_dict[word_b]
#             else:
#                 if words_a[i][0] == 'Ġ':
#                     tokens += self.bpe_tokenizer.tokenize(' ' + word_b)
#                 else: 
#                     tokens += self.bpe_tokenizer.tokenize(word_b)    
#         return tokens
        
#     def encode(self, seq):
#         return self.convert_tokens_to_ids(self.tokenize(seq))


#     def _convert_token_to_id(self, token):
#         if token in self.vocab:
#             return self.vocab[token]
#         else:
#             return self.unk_token_id

    
#     def convert_tokens_to_ids(self, tokens):
#         if isinstance(tokens, list):
#             return [self._convert_token_to_id(token) for token in tokens]
#         return self._convert_token_to_id(tokens)


#     def _tokenize(self, seq):
#         text = self.bpe_tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(seq)
#         words_a = [word for word in [word for word, offset in text]]
        
#         words_b = [word.replace('Ġ', '') for word in [word for word, offset in text]]
#         tokens = []
#         for i, word_b in enumerate(words_b):
#             if word_b in self.seg_dict:
#                 tokens += self.seg_dict[word_b]
#             else:
#                 if words_a[i][0] == 'Ġ':
#                     tokens += self.bpe_tokenizer.tokenize(' ' + word_b)
#                 else: 
#                     tokens += self.bpe_tokenizer.tokenize(word_b)    
#         return tokens



#     def __call__(self, text, return_tensors=None, padding=False, truncation=False, max_length=None, add_special_tokens=False):
#         if isinstance(text, list):
#             self.batch_encode_plus(text, truncation=truncation, return_tensors=return_tensors, padding=padding, max_length=max_length, add_special_tokens=add_special_tokens)
#         else:

#             token_ids = self.encode(text)  
#             if add_special_tokens:
#                 token_ids = [self.bos_token_id] + token_ids + [self.eos_token_id]
#             if truncation and max_length is not None:
#                 token_ids = token_ids[:max_length]
#             if padding and max_length is not None:
#                 token_ids = token_ids + [self.pad_token_id] * (max_length - len(token_ids))
#             if return_tensors == "pt":
#                 return {"input_ids": torch.tensor(token_ids, dtype=torch.long)}
#             return {"input_ids": token_ids}






#     def _convert_id_to_token(self, id):
#         # Implement your ID to token conversion logic here
#         return self.inverted_vocab[id]

#     def convert_ids_to_tokens(self, ids):
#         # Convert IDs back to tokens
#         if isinstance(ids, list):
#             return [self._convert_id_to_token(id) for id in ids]
#         return self._convert_id_to_token(ids)
    
#     def decode(self, ids):
#         out = ''
#         for id in ids:
#             word = self._convert_id_to_token(id)
#             if word[0] == 'Ġ':
#                 out += word.replace('Ġ', ' ')
#             else:
#                 out += word
#         return out
        

#     def get_special_tokens_mask(self, token_ids, already_has_special_tokens=False):
#         # Create a mask for special tokens
#         return [1 if self._is_special_token(token_id) else 0 for token_id in token_ids]


#     def _is_special_token(self, token_id):
#         # Implement your logic to check if a token is a special token
#         if token_id in self.special_tokens:
#             return True
#         else:
#             return False



#     def batch_encode_plus(self, texts, return_tensors=None, padding=False, truncation=False, max_length=None, add_special_tokens=True):
#         batch_token_ids = [self.__call__(text, truncation=truncation, max_length=max_length, add_special_tokens=add_special_tokens)["input_ids"] for text in texts]
        
#         if padding:
#             if max_length is None:
#                 max_length = max(len(ids) for ids in batch_token_ids)
#             batch_token_ids = [ids + [self.pad_token_id] * (max_length - len(ids)) for ids in batch_token_ids]
        
#         if return_tensors == "pt":
#             return {"input_ids": torch.tensor(batch_token_ids, dtype=torch.long)}
#         return {"input_ids": batch_token_ids}




In [794]:
# %%time

# import os
# import torch
# from datasets import Dataset


# # pick tokenizer
# tokenizer = ff

# # pick txt file to transform
# text_file_path = '/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/Short/OSCAR_shorter.txt'

# def tokenize_line(line):
#     # Specify the max_length parameter to ensure consistent padding
#     return {'input_ids': tokenizer(line.strip())['input_ids']}
#     #return tokenizer(line, truncation=False, max_length=512, return_tensors='pt', add_special_tokens=False)

# # Read the text file and tokenize each line
# with open(text_file_path, 'r', encoding='utf-8') as file:
#     lines = file.readlines()

# # Create a list of dictionaries with tokenized lines
# tokenized_lines = [tokenize_line(line) for line in lines]

# # Create a Hugging Face dataset from the list of dictionaries
# dataset = Dataset.from_list(tokenized_lines).with_format("torch")

# # save
# #dataset.save_to_disk('tokenized_dataset')

CPU times: user 1.18 s, sys: 29.3 ms, total: 1.21 s
Wall time: 1.3 s


In [801]:
# from transformers import DataCollatorForLanguageModeling

# data_collator = DataCollatorForLanguageModeling(
#     tokenizer=ff, mlm=True, mlm_probability=0.15
# )

# # klein model om te testen
# config = RobertaConfig(
#     vocab_size=len(tokenizer.get_vocab()),
#     max_position_embeddings=514,
#     num_attention_heads=6,
#     num_hidden_layers=3,
#     type_vocab_size=1,
# )


# model = RobertaForMaskedLM(config=config)

# print(f'number of parameters model: {model.num_parameters()}')

# from transformers import Trainer, TrainingArguments

# training_args = TrainingArguments(
#     output_dir="/Users/jan/Documents/Master/Thesis/Code/Models/model2", # note, this is not where the model is saved, just info about training
#     overwrite_output_dir=True,
#     num_train_epochs=1,
#     per_device_train_batch_size=64,
#     save_steps=10_000,
#     save_total_limit=1000,
#     prediction_loss_only=True,
# )

# trainer = Trainer(
#     model=model,
#     args=training_args,
#     data_collator=data_collator,
#     train_dataset=dataset,
# )

# %%time
# trainer.train()

### Detokenizing

We must now find a way to convert the toke ids to text. 

In [478]:
# laten we fietsen nemen

x = CustomTokenizerMorphPiece(segmentations, tokenizer)

x.encode('fietsen')

words_b: ['fietsen']
out = [15022]


[15022]

In [479]:
x.decode([15022])

'fiets'

## Evaluation of own tokenizer

### Functions to evaluate how specific words are tokenized

In [311]:
def load_simlex(simlex999, scores=False):
    # create list with a tuple for every word pair in the form of (word_1, word_2, similarity score, POS-tag)
    word_pairs = []

    # create a set with all words
    words_set = set([])

    with open(simlex999) as simlex:
        
        next(simlex) # skip first line
        
        for line in simlex:
    
            split = line.strip().split('\t')
            word_pairs.append(tuple(split))
            words_set.add(split[0])
            words_set.add(split[1])

    # create a list of unique words
    simlex_words = list(words_set)

    if scores:
        return word_pairs
    else:
        return simlex_words

def simlex_celex(df, n, simlex_words, exclusive=False, with_segments=False):

    d = create_n_segmentations(df, 0)
    dic = {}

    for i in range(n + 2):
        data = create_n_segmentations(df, i)
        dic[i] = []
        for word in simlex_words:
            if word in data:
                dic[i].append(word)

    if exclusive:
        out = list(set(dic[n]) - set(dic[n + 1]))
    else:
        out = dic[n]
    
    if with_segments:
        out = {word: d[word] for word in out}
    
    print(f'There are {len(simlex_words)} unique words in the simlex dataset')
    
    if n != 1:
        if exclusive:
            print(f'There are {len(out)} words in simlex that have exactly {n} segments in CELEX')
        else:
            print(f'There are {len(out)} words in simlex that have at least {n} segments in CELEX')
    else:
        if exclusive:
            print(f'There are {len(out)} words in simlex that have exactly {n} segment in CELEX')
        else:
            print(f'There are {len(out)} words in simlex that have at least {n} segment in CELEX') 

    return out


def tokenizer_segmentations(words, tokenizer, only_splits=False):
    
    segs = {}

    for word in words:
        tokenization = tokenizer.tokenize(word)
        tokenization = [word.replace('Ġ', '') for word in tokenization]
        tokenization = [word.replace('##', '') for word in tokenization]
        segs[word] = tokenization
    
    if only_splits:
        return {word: seg for word, seg in segs.items() if len(seg) > 1}
    else:
        return segs


def compare_tokenizer_segmentations(words, *tokenizers):

    results = {}

    for tokenizer in tokenizers:
        results[tokenizer] = tokenizer_segmentations(words, tokenizer)
    
    words_same = []
    words_diff = []

    for word in words:
        comp = {}
        for tokenizer in results:
            comp[tokenizer] = results[tokenizer][word]
        iterator = iter(comp.values())
        first_value = next(iterator)
        if all(value == first_value for value in iterator):
            words_same.append(word)
        else:
            words_diff.append(word)
    
    diff = {}
    for word in words_diff:
        c = {}
        for i, tokenizer in enumerate(tokenizers):
            c[i+1] = results[tokenizer][word]
        diff[word] = c

    
    if len(tokenizers) == 2:
        print(f'{len(words_same)} words out of the {len(words)} are tokenized in the same way by both tokenizers')
    else:
        print(f'{len(words_same)} words out of the {len(words)} are tokenized in the same way by all {len(tokenizers)} tokenizers')

    return words_same, words_diff, diff




def compare_tokenizer_with_celex(df, tokenizer):

    words = [word for word in df]
    segs = tokenizer_segmentations(words, tokenizer)

    words_same = []
    words_diff = []
    diff = {}

    for word in df:
        if df[word] == segs[word]:
            words_same.append(word)
        else:
            words_diff.append(word)
    
    diff = {}
    for word in words_diff:
        if len(segs[word]) >= 2:
            c = {}
            c['CELEX'] = df[word]
            c['tokenizer'] = segs[word]
            diff[word] = c
    
    return words_same, words_diff, diff

def return_words_not_in_dict(df, words):
    out = []
    for word in words:
        if word not in df:
            out.append(word)
    return out



### Functions to evaluate speed

In [2143]:
import time

# function to create a dataset with text 
def create_test_set(dataset_generator, start, end):
    it = iter(dataset_generator)
    for _ in range(start):
        next(it)
    for _ in range(end - start + 1):
        yield next(it)


# function to measure time a function takes
def measure_execution_time(function, *args, **kwargs):

    start_time = time.time()
    function(*args, **kwargs)
    end_time = time.time()
    return end_time - start_time

# function to measure how much time it takes for a function to process data
# the function here should not take a generator as input
# the data is supplied by a generator
def measure_time_normal_function_x(data_generator, start, end, function, *args, **kwargs):

    gen = create_test_set(data_generator, start, end)
    t=0

    for i in gen:
        text = i['text']
        t += measure_execution_time(function, text, *args, **kwargs)
    
    return t

# this function takes a generator and function as input, and measures the time it takes to execute the function for every item in the generator
def measure_time_normal_function(data_generator, start, end, function, *args, **kwargs):

    gen = create_test_set(data_generator, start, end)
    t=0

    for i in gen:
        t += measure_execution_time(function, i, *args, **kwargs)
    
    return t


# function to measure how much time it takes a function to execute for every item in the generator
# same as the function above, but this one is for functions that take a generator as argument
def measure_time_generator_function(data_generator, start, end, function, *args, **kwargs):
    
    gen = create_test_set(data_generator, start, end)
    return measure_execution_time(function, gen, *args, **kwargs)



# function to use for the generators that yield dictionaries with 'text' as key
# this is for the iterable dictionary provided by hugging face
# (We now use a wrapper generator, so probably don't need this anymore)
def measure_time_iterable_text_dict(data_generator, start, end, function, *args, **kwargs):
    
    data_generator = create_text_generator(data_generator)
    gen = create_test_set(data_generator, start, end)
    return measure_execution_time(function, gen, *args, **kwargs)



# function to turn a generator that returns a dictionary with 'text' as key into a generator of the values
def create_text_generator(gen):
    for i in gen:
        yield i['text']




### Actual evaluation and comparisons

In order to evaluate the tokenizers, we need the following elements:
- the simlex words
- a corpus in the form of a generator
- a corpus in the form of a frequency dictionary

We will create various different versions of these three elements


Let's load the tokenizers we have trained externally

In [34]:
from transformers import AutoTokenizer
from transformers import RobertaTokenizerFast
# maakt het uit of we autotokenizer of tokenizerfast gebruiken?


t1 = AutoTokenizer.from_pretrained("GroNLP/bert-base-dutch-cased")
t2 = AutoTokenizer.from_pretrained("DTAI-KULeuven/robbert-2023-dutch-large")


# WordPiece tokenizers
wp_30 = RobertaTokenizerFast.from_pretrained("/Users/jan/Documents/Master/Thesis/Code/Tokenizers/tokenizeWP/t30", max_len=512)
wp_40 = RobertaTokenizerFast.from_pretrained("/Users/jan/Documents/Master/Thesis/Code/Tokenizers/tokenizeWP/t40", max_len=512)
wp_50 = RobertaTokenizerFast.from_pretrained("/Users/jan/Documents/Master/Thesis/Code/Tokenizers/tokenizeWP/t50", max_len=512)

# BPE tokenizer
bpe_30 = RobertaTokenizerFast.from_pretrained("/Users/jan/Documents/Master/Thesis/Code/Tokenizers/tokenizeBPE/t30", max_len=512)
bpe_40 = RobertaTokenizerFast.from_pretrained("/Users/jan/Documents/Master/Thesis/Code/Tokenizers/tokenizeBPE/t40", max_len=512)
bpe_50 = RobertaTokenizerFast.from_pretrained("/Users/jan/Documents/Master/Thesis/Code/Tokenizers/tokenizeBPE/t50", max_len=512)




The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BertTokenizer'. 
The class this function is called from is 'RobertaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class

In [35]:
t30 = RobertaTokenizerFast.from_pretrained("/Users/jan/Documents/Master/Thesis/Code/Tokenizers/tokenizeXX/t30", max_len=512)
t32 = RobertaTokenizerFast.from_pretrained("/Users/jan/Documents/Master/Thesis/Code/Tokenizers/tokenizeXX/t32", max_len=512)
t35 = RobertaTokenizerFast.from_pretrained("/Users/jan/Documents/Master/Thesis/Code/Tokenizers/tokenizeXX/t35", max_len=512)
t40 = RobertaTokenizerFast.from_pretrained("/Users/jan/Documents/Master/Thesis/Code/Tokenizers/tokenizeXX/t40", max_len=512)
t45 = RobertaTokenizerFast.from_pretrained("/Users/jan/Documents/Master/Thesis/Code/Tokenizers/tokenizeXX/t45", max_len=512)
t50 = RobertaTokenizerFast.from_pretrained("/Users/jan/Documents/Master/Thesis/Code/Tokenizers/tokenizeXX/t50", max_len=512)

# own custom tokenizers
own_30 = CustomTokenizer(segmentation_dictionary, t30)
own_32 = CustomTokenizer(segmentation_dictionary, t32)
own_35 = CustomTokenizer(segmentation_dictionary, t35)
own_40 = CustomTokenizer(segmentation_dictionary, t40)
own_45 = CustomTokenizer(segmentation_dictionary, t45)
own_50 = CustomTokenizer(segmentation_dictionary, t50)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'. 
The class this function is called from is 'RobertaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'. 
The class this function is called from is 'RobertaTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class

In [37]:
len(own_50)

41752

In [36]:
print(wp_50.tokenize('ik ga lopen'))
print(bpe_50.tokenize('ik ga lopen'))
print(own_50.tokenize('ik ga lopen'))

['ik', 'ga', 'lopen']
['ik', 'Ġga', 'Ġlopen']
['Ġik', 'Ġga', 'Ġl', 'open']


In [828]:
a, b, c = compare_tokenizer_segmentations(simlex_words, wp_50, bpe_50)

#c


677 words out of the 1045 are tokenized in the same way by both tokenizers


In [832]:
def count_splits(words, tokenizer):

    toks = tokenizer_segmentations(words, tokenizer)

    split = {}
    no_split = []
    for word, segs in toks.items():
        if len(segs) > 1:
            split[word] = segs
        else:
            no_split.append(word)
    
    print(f'Number of words split up: {len(split)}')
    print(f'Number of words not split up: {len(no_split)}')

    return split, no_split


In [840]:
r, t = count_splits(simlex_words, wp_50)

Number of words split up: 161
Number of words not split up: 884


In [841]:
r, t = count_splits(simlex_words, bpe_50)

Number of words split up: 441
Number of words not split up: 604


In [844]:
r, t = count_splits(simlex_words, own_50)

Number of words split up: 390
Number of words not split up: 655


In [853]:
len(own_50.get_vocab())

46398

In [38]:
for word in simlex_words:
    print(word, own_50.tokenize(word))

prachtig ['Ġpracht', 'ig']
rekenkunde ['Ġreken', 'kunde']
doel ['Ġdoel']
gangpad ['Ġgang', 'pad']
scheikunde ['sche', 'ik', 'unde']
tabak ['Ġtabak']
situatie ['situatie']
winnen ['w', 'innen']
blik ['Ġblik']
boom ['Ġboom']
baksteen ['Ġbak', 'steen']
agressie ['Ġagressie']
lezen ['lezen']
kapitein ['Ġkapitein']
vers ['Ġvers']
regio ['Ġregio']
ballade ['Ġballade']
uitbeelden ['Ġuit', 'beeld', 'en']
kraampje ['kra', 'amp', 'je']
telefoon ['Ġtelefoon']
melodie ['mel', 'od', 'ie']
prins ['Ġprins']
koor ['Ġkoor']
discussie ['Ġdiscussie']
monster ['Ġmonster']
wortel ['Ġwortel']
zelf ['Ġzelf']
schildwacht ['Ġschild', 'wacht']
macht ['Ġmacht']
vrolijk ['Ġvrolijk']
motor ['Ġmotor']
diamant ['Ġdiamant']
theorie ['Ġtheorie']
misdaad ['Ġmis', 'daad']
emotie ['Ġemotie']
zeker ['Ġzeker']
mantel ['Ġmantel']
aanzien ['Ġaanzien']
psychologie ['psych', 'ologie']
bijwonen ['bij', 'wonen']
doos ['Ġdoos']
insect ['inse', 'ct']
doen alsof ['Ġdoen', 'Ġalsof']
toevoegen ['Ġtoe', 'voeg', 'en']
ongeduldig ['Ġon'

## Dingen opslaan

In [2277]:
import json 

with open('word_freqs_all.json', 'w') as f:
    json.dump(word_freqs_all, f)

with open('word_freqs_all_lower.json', 'w') as f:
    json.dump(word_freqs_lower_all, f)


## Kladblok

In [1890]:
3803 + 10 + 11 + 13 + 32 + 30 + 25 - 2500 + 15 + 250

1689

In [1753]:
from tqdm import tqdm
import time

def slow_function(n):
    result = 0
    for i in tqdm(range(n), desc="Processing", unit=" iterations", leave=True):
        time.sleep(0.1)  # Simulating a time-consuming task
        result += i
    return result


def test_fun(x):
    z = 0
    v = 0
    for c, i in enumerate(tqdm(range(x*999999), desc=f"Executing function. Progress", unit=" iterations", leave=True)):
        z += i
        v += c
        if c == 0:
            print('bijna')

    return z, v
    


In [1754]:
a, b = test_fun(10)

Executing function. Progress:   8%|▊         | 751612/9999990 [00:00<00:02, 3833703.04 iterations/s]

bijna


Executing function. Progress: 100%|██████████| 9999990/9999990 [00:02<00:00, 4439866.14 iterations/s]


In [1807]:
from tqdm import tqdm

def number_generator():
    for i in range(10000000):
        yield i

def fun(x):
    s = 0
    # Example usage with tqdm for progress bar
    for number in tqdm(x, total=100000000, desc="Generating numbers", unit=" number"):
        s += number*number - number
    return s

In [1808]:
x = number_generator()
fun(x)

Generating numbers:  10%|█         | 10000000/100000000 [00:02<00:21, 4175792.71 number/s]


333333233333340000000

In [1714]:
slow_function(100)

Processing: 100%|██████████| 100/100 [00:10<00:00,  9.49 iterations/s]


4950

In [1732]:
test_fun(100)

Executing function. Progress: 100%|██████████| 99999900/99999900 [00:16<00:00, 5971319.45 iterations/s]


4999989950005050

In [1726]:
test_fun(100)

Progressss: 100%|██████████| 99999900/99999900 [00:20<00:00, 4828428.92 iterations/s]


989999010000

In [1699]:
b = {'cat': 'N', 'segments': ['be', 'houd']}

'a' in b

False

In [741]:
words['behoud']

{'cat': 'N', 'segments': ['be', 'houd']}

In [262]:
ids = tokenizer.encode('dit is een test met ongberuikelijkbare woorden')
ids

[2283, 3586, 940, 6589, 2207, 8304, 1512, 5563, 363, 809, 2578]

In [1389]:
x = {1: 10, 2: 20, 3: 30}
y = {3: 40, 4: 50}

x | y 

{1: 10, 2: 20, 3: 40, 4: 50}

In [1390]:
x = {1: 10, 2: 20, 3: 30}
y = {3: 40, 4: 50}

y | x

{3: 30, 4: 50, 1: 10, 2: 20}

In [1394]:
def ddd(a, b,
                  c):
    print(a)

ddd(3, 4, 5)


3


In [1397]:
print(f'''There are 3 entries in the database. Out of these:
- 5 words have no segmentations
        - 6 words have a single morpheme as segmentation 
        - 7 words are split up into multiple morphemes''')

There are 3 entries in the database. Out of these:
- 5 words have no segmentations
        - 6 words have a single morpheme as segmentation 
        - 7 words are split up into multiple morphemes


In [1600]:


# this function adds the word as segmentation of itself for all words that have an empty list as segmentation
def add_empty_segmentations(df):

    out = {}

    for word, seg in df.items():
        if len(seg) == 0:
            out[word] = [word]
        else:
            out[word] = seg

    return out

In [1601]:
a = {'a': [], 'b': ['r']}

b = add_empty_segmentations(a)

b

{'a': ['a'], 'b': ['r']}

In [1596]:
a = {'a': [], 'b': ['r']}

a[4] = 5

a

{'a': [], 'b': ['r'], 4: 5}

In [1742]:


with open(oscar1) as osc:
    n = 0
    for line in osc:
        n+= 1
    print(n)


6702288


In [1743]:
oscar2 = os.path.join(data_path, 'OSCAR', 'nl_part_2.txt')



with open(oscar2) as osc:
    n = 0
    for line in osc:
        n+= 1
    print(n)

6803845


In [1793]:
def create_word_freqs_from_online_corpus(corpus_generator, sorted=False, progress=True, avg_size=6750000, n_files=45):

    size = n_files * avg_size
    
    if progress:
        print(f'The estimated size of the entire corpus is around {format_with_dots(size)} lines of text!')
        
        word_freqs = {}

        for i in range(1):
            if i == 0:
                print(f'Generating the frequency dictionary ...\n')

        for i in tqdm(corpus_generator, total=size, desc="Progress", unit=" iterations"):     
            text = preprocess_basic(i['text'])
            for word in text:
                if word in word_freqs:
                    word_freqs[word] += 1
                else:
                    word_freqs[word] = 1
    
    return word_freqs
    











# this function creates a word frequency dictionary for a corpus, so that we can count faster later on
def create_word_freqs_from_local_corpus(corpus_generator, sorted=False, progress=True, path=0):
    
    if progress:
        print(f'Calculating the size of the dataset ...\n')
        if path == 0:
            assert path == 1, 'Enter path to get progress bar, or set progress=False to perform the function without one'
        else:
            size = get_size_for_local(path)
    
        word_freqs = {}
        

        for i in range(1):
            if i == 0:
                print(f'Data size: {format_with_dots(size)} lines of text! Generating the frequency dictionary ...\n')

        for i in tqdm(corpus_generator, total=size, desc="Progress", unit=" iterations"):     
            text = preprocess_basic(i['text'])
            for word in text:
                if word in word_freqs:
                    word_freqs[word] += 1
                else:
                    word_freqs[word] = 1
    
    else:

        print('Performing task without progress bar')

        word_freqs = {}

        for i in corpus_generator:     
            text = preprocess_basic(i['text'])
            for word in text:
                if word in word_freqs:
                    word_freqs[word] += 1
                else:
                    word_freqs[word] = 1
        
    
    if sorted:
        return dict(sorted(word_freqs.items(), key=lambda item: item[1], reverse=True))
    
    return word_freqs



def get_size_for_local(path):
    with open(path) as osc:
        n = 0
        for line in osc:
            n+= 1
        return n 

def format_with_dots(number):
    return f"{number:,}".replace(",", ".")


In [1789]:
data = dataset_from_hub

In [2164]:
seq = 'wienfoiwef weff vervfe verv PP'

seq.strip().lower().split()

['wienfoiwef', 'weff', 'vervfe', 'verv', 'pp']

## Jobs


#### Maak frequency dict

Stap 1: maak data stream

In [2212]:
from datasets import load_dataset, DatasetDict
import os


# function that returns a dictionary with a generator for every existing OSCAR file in this computer
def create_local_oscar_generators(data_path, i=0, j=0):

    out = {}
    
    if j > i:
        n = j - i

        for x in range(i, j+1):
            full_path = os.path.join(data_path, 'OSCAR', f'nl_part_{x}.txt')
            if os.path.isfile(full_path):
                out[f'oscar{i}'] = create_text_generator(load_dataset('text', data_files={"train": full_path}, split='train', streaming=True))
        
        if len(out) != n + 1:
            print('Not all parts requested are on this computer')
    
    else:

        for i in range(1, 50):
            full_path = os.path.join(data_path, 'OSCAR', f'nl_part_{i}.txt')
            if os.path.isfile(full_path):
                out[f'oscar{i}'] = create_text_generator(load_dataset('text', data_files={"train": full_path}, split='train', streaming=True))


    return out


# function that creates one generator out of multiple generators
def create_super_generator(generator_dict, list_input=False):

    if list_input:
        for generator in generator_dict:
            yield from generator
    else:
        for generator in generator_dict.values():
            yield from generator


# one function to create OSCAR generator by combining n parts of the dataset, from part i to part j
def create_super_local_oscar_generator(data_path, i=0, j=0):
    
    if j > i:
        generators = create_local_oscar_generators(data_path, i=i, j=j)
    else:
        generators = create_local_oscar_generators(data_path)

    return create_super_generator(generators)


# function to create a dataset with text 
def create_test_set(dataset_generator, start, end):
    it = iter(dataset_generator)
    for _ in range(start):
        next(it)
    for _ in range(end - start + 1):
        yield next(it)


# function to turn a generator that returns a dictionary with 'text' as key into a generator of the values
def create_text_generator(gen):
    for i in gen:
        yield i['text']








# # set path to datasets
# data_path = '/Users/jan/Documents/Master Information Studies/Thesis/Code/Datasets'

# # create super generator from all OSCAR files on computer
# oscar_gen_super = create_super_local_oscar_generator(data_path)

# # create small dataset (uneven number of lines)
# oscar_gen_small = create_test_set(oscar_gen_1, 0, 100007)

In [111]:
data_path = '/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR'

import os

def get_all_file_paths(folder_path):
    file_paths = []
    for root, directories, files in os.walk(folder_path):
        # Filter out directories that start with a dot
        directories[:] = [d for d in directories if not d.startswith('.')]
        for file in files:
            # Filter out files that start with a dot
            if not file.startswith('.'):
                file_path = os.path.join(root, file)
                if os.path.isfile(file_path):  # Check if the path is a file
                    file_paths.append(file_path)
    return file_paths


paths = get_all_file_paths(data_path)



In [117]:


p = '/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/Short/OSCAR_short.txt'
d = load_dataset('text', data_files={"train": p}, split='train')

In [120]:
def word_freqs_multiple_paths(paths):
    word_freqs = {}
    for path in paths:
        dataset = load_dataset('text', data_files={"train": path}, split='train')
        for i in dataset:
            for word in preprocess_lower(i['text']):
                if word in word_freqs:
                    word_freqs[word] += 1
                else:
                    word_freqs[word] = 1
    return word_freqs
        

In [121]:
word_freqs_multiple_paths(paths)

KeyboardInterrupt: 

In [2268]:
# deze code is gebruikt in freqs1.job (freqs1.py)


import string
import os
import json
from datasets import load_dataset, DatasetDict

path = '/home/scur2141/datasets'



def preprocess_lower(seq):
    return [s.strip(string.punctuation) for s in seq.strip().lower().split()]


def create_data_gen(path):

    iterators = []

    for i in range(1, 50):
        full_path = os.path.join(path, f'nl_part_{i}.txt')
        if os.path.isfile(full_path):
            print('ja')
            iterators.append(load_dataset('text', data_files={"train": full_path}, split='train', streaming=True))
    
    for it in iterators:
        yield from it
        
    
def create_word_freqs_from_corpus(corpus_generator, sort=False):
    
    word_freqs = {}

    for i in corpus_generator:     
        text = preprocess_lower(i['text'])
        for word in text:
            if word in word_freqs:
                word_freqs[word] += 1
            else:
                word_freqs[word] = 1
    
    if sort:
        word_freqs = dict(sorted(word_freqs.items(), key=lambda item: item[1], reverse=True))
    
    return word_freqs


data_it = create_data_gen(path)
freqs = create_word_freqs_from_corpus(data_it, sort=True)


with open('frequencies.json', 'w') as f:
    json.dump(freqs, f)



In [2238]:
x_dict = {3:4, 5: 6}

with open('test_dict.json', 'w') as f:
    json.dump(x_dict, f)



In [2239]:
with open('test_dict.json', 'r') as f:
    my_dict = json.load(f)

In [2240]:
my_dict

{'3': 4, '5': 6}

In [2255]:
from datasets import load_dataset, DatasetDict
import os


# function that returns a dictionary with a generator for every existing OSCAR file in this computer
def create_local_oscar_generators(path, i=0, j=0):

    out = {}
    
    if j > i:
        n = j - i

        for x in range(i, j+1):
            full_path = os.path.join(path, f'nl_part_{x}.txt')
            if os.path.isfile(full_path):
                out[f'oscar{i}'] = create_text_generator(load_dataset('text', data_files={"train": full_path}, split='train', streaming=True))
        
        if len(out) != n + 1:
            print('Not all parts requested are on this computer')
    
    else:

        for i in range(1, 50):
            full_path = os.path.join(path, f'nl_part_{i}.txt')
            print(full_path)
            if os.path.isfile(full_path):
                print('ja')
                out[f'oscar{i}'] = create_text_generator(load_dataset('text', data_files={"train": full_path}, split='train', streaming=True))

    return out


# function that creates one generator out of multiple generators
def create_super_generator(generator_dict, list_input=False):

    if list_input:
        for generator in generator_dict:
            yield from generator
    else:
        for generator in generator_dict.values():
            yield from generator


# one function to create OSCAR generator by combining n parts of the dataset, from part i to part j
def create_super_local_oscar_generator(data_path, i=0, j=0):
    
    if j > i:
        generators = create_local_oscar_generators(data_path, i=i, j=j)
    else:
        generators = create_local_oscar_generators(data_path)

    return create_super_generator(generators)


# function to create a dataset with text 
def create_test_set(dataset_generator, start, end):
    it = iter(dataset_generator)
    for _ in range(start):
        next(it)
    for _ in range(end - start + 1):
        yield next(it)


# function to turn a generator that returns a dictionary with 'text' as key into a generator of the values
def create_text_generator(gen):
    for i in gen:
        yield i['text']


In [2242]:
path = 'Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR'

x = create_super_generator(create_local_oscar_generators(path))




In [2243]:
for i, line in enumerate(x):
    if i < 6:
        print(line)

In [2257]:
path = 'Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR'
x = create_local_oscar_generators(path)

os.path.isfile('Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_1.txt')


Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_1.txt
Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_2.txt
Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_3.txt
Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_4.txt
Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_5.txt
Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_6.txt
Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_7.txt
Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_8.txt
Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_9.txt
Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_10.txt
Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_11.txt
Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_12.txt
Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_13.txt
Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_14.txt
Users/jan/Documents/Master/Thesis/Code/Data

False

In [2254]:
x


{}

In [2269]:
path = 'Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR'
y = create_data_gen(path)

In [2270]:
next(y)


StopIteration: 

In [2261]:
os.path.isfile('Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_1.txt')



False

In [105]:
path = 'Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR'

paths = []
for x in range(25):
    full_path = os.path.join(path, f'nl_part_{x}.txt')
    if os.path.isfile(full_path):
        paths.append(full_path)
        # out[f'oscar{i}'] = create_text_generator(load_dataset('text', data_files={"train": full_path}, split='train', streaming=True))

In [106]:
paths

[]

In [134]:
# preprocess function
def preprocess_lower(seq):
    return [s.strip(string.punctuation) for s in seq.strip().lower().split()]

# find paths
def get_all_file_paths(folder_path):
    file_paths = []
    for root, directories, files in os.walk(folder_path):
        # Filter out directories that start with a dot
        directories[:] = [d for d in directories if not d.startswith('.')]
        for file in files:
            # Filter out files that start with a dot
            if not file.startswith('.'):
                file_path = os.path.join(root, file)
                if os.path.isfile(file_path):  # Check if the path is a file
                    file_paths.append(file_path)
    return file_paths


paths = get_all_file_paths('/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/Short')

print(paths)


# make dict
def word_freqs_multiple_paths(paths):
    word_freqs = {}
    for path in paths:
        dataset = load_dataset('text', data_files={"train": path}, split='train')
        for i in dataset:
            for word in preprocess_lower(i['text']):
                if word in word_freqs:
                    word_freqs[word] += 1
                else:
                    word_freqs[word] = 1
    return word_freqs

word_freqs = word_freqs_multiple_paths(paths)


print('freqs dict is gemaakt')


# store
with open('frequencies20.json', 'w') as f:
    json.dump(word_freqs, f)

['/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/Short/OSCAR_short.txt']
freqs dict is gemaakt


In [136]:
with open('frequencies20.json', 'r') as f:
    a = json.load(f)

In [137]:
a

{'vul': 377,
 'het': 108245,
 'e-mailadres': 542,
 'in': 83947,
 'dat': 35537,
 'bij': 22024,
 'uw': 11450,
 'account': 606,
 'hoort': 420,
 'er': 19831,
 'zal': 4341,
 'een': 110514,
 'verificatiecode': 10,
 'naar': 16759,
 'worden': 14045,
 'verzonden': 252,
 'wanneer': 2625,
 'u': 23911,
 'de': 219750,
 'heeft': 12072,
 'ontvangen': 1031,
 'kunt': 6300,
 'nieuw': 1513,
 'wachtwoord': 355,
 'kiezen': 1121,
 'voor': 50721,
 'gebruikersnaam': 104,
 'dit': 15729,
 'wijkagent': 9,
 'michel': 52,
 'van': 129018,
 'kempen': 33,
 'micheal': 1,
 'is': 56843,
 'nijmegen': 221,
 'centrum': 937,
 'geworden': 632,
 'zijn': 34003,
 'vorige': 542,
 'wijken': 88,
 'voorlopig': 129,
 'onderverdeeld': 28,
 'koen': 40,
 'en': 126294,
 'yvonne': 25,
 'zodra': 456,
 'nieuwe': 5391,
 'zullen': 1714,
 'wij': 9959,
 'hier': 5854,
 'kenbaar': 58,
 'maken': 6738,
 'wensen': 521,
 'heel': 3699,
 'veel': 6445,
 'succes': 478,
 'plezier': 518,
 'wijk': 263,
 'om': 29232,
 'best': 1063,
 'mogelijke': 457,
 'webs

In [133]:
get_all_file_paths('/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR')


['/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_1.txt',
 '/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_2.txt',
 '/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/Short/OSCAR_short.txt']

In [131]:
paths = get_all_file_paths('/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR')


In [132]:
paths

['/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_1.txt',
 '/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/nl_part_2.txt',
 '/Users/jan/Documents/Master/Thesis/Code/Datasets/OSCAR/Short/OSCAR_short.txt']

## Bewaren voor de zekerheid

In [None]:
import matplotlib.pyplot as plt

def count_morpheme_set(treshold):
    morpheme_set = set([])
    for word, freq in word_freqs.items():
        if freq >= treshold:
            for morpheme in segmentations_lowercase[word]:
                morpheme_set.add(morpheme)
    return len(morpheme_set)

results = {}
for i in range(30):
    results[i] = count_morpheme_set(i)

# plot
keys = list(results.keys())
values = list(results.values())

plt.figure(figsize=(10, 6))
plt.plot(keys, values, marker='o', linestyle='-', color='b')
plt.title('Morphemes vs Treshold')
plt.xlabel('Treshold')
plt.ylabel('Number of morphemes')
plt.grid(True)
plt.show()

In [None]:
# set vocabulary size
vocab_size = 20000

# select dataset to train with
train_set = bpe_generator(data, vocab, tokenizer)

# load an existing BPE tokenizer
old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

# retrain the tokenizer
tokenizer = old_tokenizer.train_new_from_iterator(train_set, vocab_size)

In [814]:
#t_robbert = RobertaTokenizerFast.from_pretrained("DTAI-KULeuven/robbert-2022-dutch-base")

t_robbert = AutoTokenizer.from_pretrained("DTAI-KULeuven/robbert-2022-dutch-base")

In [815]:
t_robbert('Ik ga morgen lopen.')

{'input_ids': [0, 204, 544, 2149, 733, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [817]:
t_robbert('Ik ga morgen lopen. Daarna ga ik weer lopen. Gisteren heb ik een fiets gekocht. Dit is een input tekst.')

{'input_ids': [0, 204, 544, 2149, 733, 4, 1903, 544, 29, 87, 733, 4, 13932, 88, 29, 9, 1083, 2203, 4, 112, 12, 9, 12960, 1049, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [808]:
v = t_robbert.get_vocab()
g = {value: key for key, value in v.items()}


In [810]:
g[13932]

'ĠGisteren'

## Met ortho

In [None]:
dik = create_segmentation_dictionary(segmentation_data, word_family_data, word_freqs_20, extra_loop=True, add_morphemes=True, 
                                   add_empty=False, add_plurals=True, replace_non_identical=False, add_verbs=True, greedy_verb=False,
                                   add_nouns=True, greedy_noun=False, replace_verbs=True, replace_nouns=False, min_n_segments=1, 
                                   add_compounds=True, replace_compounds=False, remove_ortho=True, remove_not_in_corpus=False, meta_data=False, print_info=True):

In [83]:
base = create_initial_dataframe(segmentation_data)


# create initial segmentation dictionary
dik = create_segmentations_from_base(base)

In [85]:
dik2 = {}
for word, segs in dik.items():
    if len(segs) > 0 and ''.join(segs) != word:
        dik2[word] = segs

In [86]:
dik2

{'aaien': ['aai'],
 'aalbessengelei': ['aal', 'bes', 'en', 'gelei'],
 'aalbessenjam': ['aal', 'bes', 'en', 'jam'],
 'aalbessenjenever': ['aal', 'bes', 'en', 'jenever'],
 'aalbessesap': ['aal', 'bes', 'e', 'sap'],
 'aalbessestruik': ['aal', 'bes', 'e', 'struik'],
 'aalmoezenier': ['aalmoes', 'enier'],
 'aalsteker': ['aal', 'steek', 'er', 'NV'],
 'aambeeldsbeentje': ['aambeeld', 's', 'been'],
 'aanaarden': ['aan', 'aarde'],
 'aanbakken': ['aan', 'bak'],
 'aanbeeldsbeentje': ['aanbeeld', 's', 'been'],
 'aanbenen': ['aan', 'been'],
 'aanbehoren': ['aan', 'be', 'hoor'],
 'aanbellen': ['aan', 'bel'],
 'aanbelanden': ['aan', 'be', 'land'],
 'aanbelangen': ['aan', 'belang'],
 'aanbermen': ['aan', 'berm'],
 'aanbesteder': ['aan', 'besteed', 'er'],
 'aanbesteding': ['aan', 'besteed', 'ing'],
 'aanbesteden': ['aan', 'besteed'],
 'aanbesterven': ['aan', 'be', 'sterf'],
 'aanbeteren': ['aan', 'beter'],
 'aanbevelen': ['aan', 'beveel'],
 'aanbevelenswaard': ['aan', 'beveel', 's', 'waard'],
 'aanbeve

In [119]:
def split_verbs(base):

    verbs = {}
    rest = {}





    for word, dic in base.items():
        
        seg = dic['segments1']
        split1 = seg.split('+')
        split2 = dic['segments2']

        concat1 = ''.join(split1)
        concat2 = ''.join(split2)

        if concat2 != word and len(seg) > 0:


            if dic['cat'] == 'V':

                verbs[word] = split2
            
            else:

                rest[word] = split2


    return verbs, rest

In [120]:
v, n = split_verbs(base)

In [121]:
for word, segs in v.items():
    if segs[-1] == 'en':
        print(word)


In [122]:
n

{'aalbessengelei': ['aal', 'bes', 'en', 'gelei'],
 'aalbessenjam': ['aal', 'bes', 'en', 'jam'],
 'aalbessenjenever': ['aal', 'bes', 'en', 'jenever'],
 'aalbessesap': ['aal', 'bes', 'e', 'sap'],
 'aalbessestruik': ['aal', 'bes', 'e', 'struik'],
 'aalmoezenier': ['aalmoes', 'enier'],
 'aalmoezenierskamer': ['aalmoes', 'enier', 's', 'kamer'],
 'aalsteker': ['aal', 'steek', 'er', 'NV'],
 'aalvormig': ['aal', 'vorm', 'ig', 'NN'],
 'aambeeldsbeentje': ['aambeeld', 's', 'been'],
 'aamborstig': ['aam', 'borst', 'ig', 'NN'],
 'aamborstigheid': ['aam', 'borst', 'ig', 'NN', 'heid'],
 'aanaarding': ['aan', 'aarde', 'ing'],
 'aanaardploeg': ['aan', 'aarde', 'ploeg'],
 'aanbeeldsbeentje': ['aanbeeld', 's', 'been'],
 'aanbesteder': ['aan', 'besteed', 'er'],
 'aanbesteding': ['aan', 'besteed', 'ing'],
 'aanbetaling': ['aan', 'betaal', 'ing'],
 'aanbevelenswaard': ['aan', 'beveel', 's', 'waard'],
 'aanbevelenswaardig': ['aan', 'beveel', 's', 'waarde', 'ig'],
 'aanbeveling': ['aan', 'beveel', 'ing'],
 '

In [105]:
def add_en(df):

    out = {}
    same = {}

    for word, segs in df.items():
        # if word[-2:] == 'en':
        #     out[word] = segs + ['en']
        # else:
        #     same[word] = segs
        
        out[word] = segs + ['en']
    

    
    return out, same

In [103]:
v2, same = add_en(v)

In [106]:
rest = {}
for


{'a': ['a'],
 'Aafje': [''],
 'Aafke': [''],
 'Aagje': [''],
 'aagt': ['aagt'],
 'aagtappel': ['aagt', 'appel'],
 'aai': ['aai'],
 'aaiing': ['aai', 'ing'],
 'aak': ['aak'],
 'aal': ['aal'],
 'aaltje': [''],
 'aaltjes': [''],
 'aalbes': ['aal', 'bes'],
 'aalbessengelei': ['aalbes', 'en', 'gelei'],
 'aalbessenjam': ['aalbes', 'en', 'jam'],
 'aalbessenjenever': ['aalbes', 'en', 'jenever'],
 'aalbessesap': ['aalbes', 'e', 'sap'],
 'aalbessestruik': ['aalbes', 'e', 'struik'],
 'Aalders': [''],
 'aalelger': ['aal', 'elger'],
 'aalfuik': ['aal', 'fuik'],
 'aalgeer': [''],
 'aalglad': ['aal', 'glad'],
 'aalkaar': ['aal', 'kaar'],
 'aalkast': ['aal', 'kast'],
 'aalkorf': ['aal', 'korf'],
 'aalkuip': ['aal', 'kuip'],
 'aalkwab': ['aal', 'kwab'],
 'aalkwabbe': [''],
 'aalmoes': [''],
 'aalmoezenier': ['aalmoes', 'enier'],
 'aalmoezenierskamer': ['aalmoezenier', 's', 'kamer'],
 'aalpomp': ['aal', 'pomp'],
 'aalput': ['aal', 'put'],
 'aalreep': ['aal', 'reep'],
 'aalreiger': ['aal', 'reiger'],
 'a