# Prepare the Datasets
Before we start training a model and generating pseudo-words, we will need to prepare our datasets.
Hunspell dictionaries are a good place to start to collect a sizable list of lemmas, it can also be used at a later stage to ensure that the pseudo-words generated are not inflected versions of real words.

## Download a Hunspell Dictionary
Clicking on [this link](https://mozilla-l10n.github.io/firefox-dictionaries/complete.html) you will find a list of available and up-to-date dictionaries.
Find the dictionary you want to train your model on. Once on the page of the dictionary you want to download, instead of clicking on "add to Firefox", right-click and select "copy the link".

In [1]:
!pip install requests spylls



In [8]:
import requests
import zipfile
import io
import os
import shutil

# Create a directory to store the dictionaries
os.makedirs("dictionaries", exist_ok=True)

# Replace this link
# for Breton (be): https://addons.mozilla.org/firefox/downloads/file/4270474/difazier_an_drouizig-0.17resigned1.xpi
# for Welsh (cy_GB): https://addons.mozilla.org/firefox/downloads/file/4270302/geiriadur_cymraeg-1.8.3resigned1.xpi
# for English (en-GB): https://addons.mozilla.org/firefox/downloads/file/4270302/geiriadur_cymraeg-1.8.3resigned1.xpi
dictionary_url = "https://addons.mozilla.org/firefox/downloads/file/4270302/geiriadur_cymraeg-1.8.3resigned1.xpi"
# Replace with the appropriate ISO-369 code
locale = "cy_GB"

if os.path.isdir(f"dictionaries/{locale}"):
    shutil.rmtree(f"dictionaries/{locale}")

# Download and extract dictionary
response = requests.get(dictionary_url)
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
    z.extractall("dictionaries")
    print(f"{locale} dictionary decompressed")

# standardize name of files
for file in os.listdir(f"dictionaries/"):
    if file.endswith(".dic"):
        os.rename(f"dictionaries/dictionaries/{file}", f"dictionaries/{locale}.dic")
    elif file.endswith(".aff"):
        os.rename(f"dictionaries/dictionaries/{file}", f"dictionaries/{locale}.aff")

try:
    files = os.listdir("dictionaries")
    shutil.rmtree("dictionaries/META-INF/")
    for file in files:
        file_path = os.path.join("dictionaries", file)
        if os.path.isfile(file_path):
            os.remove(file_path)
    print("Folder cleaned successfully.")
except OSError:
    print("Error occurred while deleting files.")

os.rename("dictionaries/dictionaries", f"dictionaries/{locale}")
print("Check out your dictionary in", f"dictionaries/{locale}")

cy_GB dictionary decompressed
Folder cleaned successfully.
Check out your dictionary in dictionaries/cy_GB


You should have a none-empty `dictionaries` showing up in file browser in the left. Next, we will extract a list of lemmas from these dictionaries.

## Extract and Dump the List of lemmas
For various reasons, we don't want to end up with all the entries from the dictionary. Some may be proper names or contain numbers (e.g. 4x4) that don't allow them to qualify as lemmas. Here, for English lemmas, we used the following regex to filter words only containing lowercase letter: `re.compile('[a-z]+')`.
Depending on the language you want to treat, you may want to extend this regex, for example, Breton allows `c'h` as a common trigram, and the `'` character would need to be added, half of the Check alphabet would also be excluded from this regex (à, č, ź...) and I am not mentioning the letters from the Cyrillic or Greek alphabets... So think of adapting the regex in order not to miss common characters in your training dataset.

In [45]:
from spylls.hunspell import Dictionary
import os
import re

# Function to extract lemmas from a .dic file
def extract_lemmas(dic_path):
    lemmas = []
    lowercase = re.compile('[a-zŵŷ]+')
    with open(dic_path, 'r', encoding='utf-8') as file:
        for line in file:
            # Each line in the .dic file contains a word followed by its affix flags
            # We only need the word part, which is before the '/' character
            word = line.split('/')[0]
            word2 = lowercase.findall(word)
            if len(word2) and word == word2[0]:
                lemmas.append(word)
    return lemmas

# Path to the dictionary files
dic_path = f"dictionaries/{locale}/{locale}.dic"

# Extract lemmas
lemmas = extract_lemmas(dic_path)

# Print the first 10 lemmas as a sample
print(lemmas[:10])

['a', 'ab', 'abaci', 'abacws', 'abad', 'abadaeth', 'abades', 'aballu', 'abatai', 'abatir']


Now, if we are happy with the items being displayed, we can carry on and dump them in a json file.

In [44]:
# Dump the lemmas to a json file
import json

# Define the output file path
output_file_path = f"dictionaries/{locale}/lemmas.json"

# Write the lemmas list to the JSON file
with open(output_file_path, 'w', encoding='utf-8') as outfile:
    json.dump(lemmas, outfile, ensure_ascii=False, indent=4)

print(f"{len(lemmas)} items extracted and saved to {output_file_path}")

58953 items extracted and saved to dictionaries/cy_GB/lemmas.json


If you see the number of items you successfully extracted, you're ready for the next part.
Please note that this is only one way to construct a list of words. Depending on the language, you may want to do web scrapping on an online dictionary. But every online dictionary has a different interface, and in that case you would have had to rewrite your own code. This is why we used a Hunspell dictionary, but it is definitely not a definitive nor universal way to do things.