Goal for this notebook is to create a custom dictionary, and tools to quickly create a custom dictionary from any text. I'll use shakespeare to start since I already have that text available.

In [2]:
# Open the file in read mode ('r')
with open('../data/shakespeare.txt', 'r', encoding='utf-8') as file:
    # Read the entire file into a string variable
    content = file.read()

# Split the content by spaces to get words
words = content.split()

# Now the variable `words` is a list containing all the words
print(len(words))

unique_words = set(words)
print(len(unique_words))

963027
71927


Shakespeare has a big vocabulary! But also, just splitting on whitespace as I did here has much worse performance than actual tokenization. For example, for many words I have multiple versions that contain punctuation, capitalization, or which are hyphenated, and these are all treated separately. I suspect this will lead to very poor performance when used with my actual model, but at this point I'm just concerned about getting a working MVP quickly so whatever. This is very low hanging fruit for improvement later though!

In [6]:
word_to_id = {word: i for i, word in enumerate(unique_words)}
id_to_word = {i: word for i, word in enumerate(unique_words)}

print(id_to_word[500])
print(word_to_id['mural'])

mural
500


In [7]:
# putting it all together:
def build_dictionary(file_path) -> (dict, dict):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    words = content.split()
    unique_words = set(words)
    word_to_id = {word: i for i, word in enumerate(unique_words)}
    id_to_word = {i: word for i, word in enumerate(unique_words)}
    return word_to_id, id_to_word

In [9]:
word_to_id, id_to_word = build_dictionary('../data/shakespeare.txt')
print(id_to_word[500])
print(word_to_id['mural'])
print(len(word_to_id))
print(len(id_to_word))

mural
500
71927
71927


Cool, we can now quickly build a (crappy) dictionary from any file. Again, lots of room to improve here with an actually intelligent tokenization approach, but for now it's time to build a quick and dirty dictionary and incorporate it into my model.