# Prepare the Datasets
Before we start training a model and generating pseudo-words, we will need to prepare our datasets.
Hunspell dictionaries are a good place to start to collect a sizable list of lemmas, it can also be used at a later stage to ensure that the pseudo-words generated are not inflected versions of real words.

## Download a Hunspell Dictionary
Clicking on [this link](https://mozilla-l10n.github.io/firefox-dictionaries/complete.html) you will find a list of available and up-to-date dictionaries.
Find the dictionary you want to train your model on. Once on the page of the dictionary you want to download, instead of clicking on "add to Firefox", right-click and select "copy the link". Then past the value to assign it to the variable `dictionary_url`.
Think to also set the value of the variable `locale`, checkout the column "Dictionary Locale" of the table in the list of the dictionaries, as they might not be shaped the same way, for example, Welsh is "cy_GB" but British English is "en-GB".

In [1]:
%pip install requests spylls


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [42]:
import requests
import zipfile
import io
import os
import shutil

# Create a directory to store the dictionaries
os.makedirs("locales", exist_ok=True)

# Replace this link
# for Breton (be): https://addons.mozilla.org/firefox/downloads/file/4270474/difazier_an_drouizig-0.17resigned1.xpi
# for Welsh (cy_GB): https://addons.mozilla.org/firefox/downloads/file/4270302/geiriadur_cymraeg-1.8.3resigned1.xpi
# for English (en-GB): https://addons.mozilla.org/firefox/downloads/file/4270302/geiriadur_cymraeg-1.8.3resigned1.xpi
# for Dutch (nl): https://addons.mozilla.org/firefox/downloads/file/3776797/woordenboek_nederlands-4.20.19.xpi
dictionary_url = "https://addons.mozilla.org/firefox/downloads/file/4270474/difazier_an_drouizig-0.17resigned1.xpi"

# Replace with the appropriate ISO-369 code
locale = "br"

if os.path.isdir(f"locales/{locale}"):
    shutil.rmtree(f"locales/{locale}")

# Download and extract dictionary
response = requests.get(dictionary_url)
with zipfile.ZipFile(io.BytesIO(response.content)) as z:
    z.extractall("locales")
    print(f"{locale} dictionary decompressed")

# standardize name of files
for file in os.listdir(f"locales/"):
    if file.endswith(".dic"):
        os.rename(f"locales/dictionaries/{file}", f"locales/{locale}.dic")
    elif file.endswith(".aff"):
        os.rename(f"locales/dictionaries/{file}", f"locales/{locale}.aff")

try:
    files = os.listdir("locales")
    shutil.rmtree("locales/META-INF/")
    for file in files:
        file_path = os.path.join("locales", file)
        if os.path.isfile(file_path):
            os.remove(file_path)
    print("Folder cleaned successfully.")
except OSError:
    print("Error occurred while deleting files.")

os.rename("locales/dictionaries", f"locales/{locale}")
print("Check out your dictionary in", f"locales/{locale}")

br dictionary decompressed
Folder cleaned successfully.
Check out your dictionary in locales/br


You should have a none-empty `dictionaries` showing up in file browser in the left. Next, we will extract a list of lemmas from these dictionaries.

## Extract and Dump the List of lemmas
For various reasons, we don't want to end up with all the entries from the dictionary. Some may be proper names or contain numbers (e.g. 4x4) that don't allow them to qualify as lemmas. Here, for English lemmas, we used the following regex to filter words only containing lowercase letter: `re.compile('[a-z]+')`.
Depending on the language you want to treat, you may want to extend this regex, for example, Breton allows `c'h` as a common trigram, and the `'` character would need to be added, half of the Check alphabet would also be excluded from this regex (à, č, ź...) and I am not mentioning the letters from the Cyrillic or Greek alphabets... So think of adapting the regex in order not to miss common characters in your training dataset.

In [2]:
import os
import re

# Function to extract lemmas from a .dic file
def extract_lemmas(dic_path):
    lemmas = []
    lowercase = re.compile('[a-zŵŷë]+')
    with open(dic_path, 'r', encoding='utf-8') as file:
        for line in file:
            # Each line in the .dic file contains a word followed by its affix flags
            # We only need the word part, which is before the '/' character
            word = line.split('/')[0]
            word2 = lowercase.findall(word)
            if len(word2) and word == word2[0]:
                lemmas.append(word)
    return lemmas

# Path to the dictionary files
dic_path = f"locales/{locale}/{locale}.dic"

# Extract lemmas
lemmas = extract_lemmas(dic_path)

# Print the first 10 lemmas as a sample
print(lemmas[:10])

['aagje', 'aagtappel', 'aagt', 'aaibaar', 'aaibaarheid', 'aaien', 'aaiing', 'aai', 'aai', 'aak']


Now, if we are happy with the items being displayed, we can carry on and dump them in a json file.

In [4]:
# Dump the lemmas to a json file
import json

# Define the output file path
output_file_path = f"locales/{locale}/lemmas.json"

# Write the lemmas list to the JSON file
with open(output_file_path, 'w', encoding='utf-8') as outfile:
    json.dump(lemmas, outfile, ensure_ascii=False, indent=4)

print(f"{len(lemmas)} items extracted and saved to {output_file_path}")

87407 items extracted and saved to dictionaries/nl/lemmas.json


If you see the number of items you successfully extracted, you're ready for the next part.
Please note that this is only one way to construct a list of words. Depending on the language, you may want to do web scrapping on an online dictionary. But every online dictionary has a different interface, and in that case you would have had to rewrite your own code. This is why we used a Hunspell dictionary, but it is definitely not a definitive nor universal way to do things.

# Bonus: loading data from a real dictionary
In this bonus section, we show how to load all the lemmas from the Breton dictionary [Devri](https://devri.bzh/). 


In [57]:
from itertools import chain

def clean(e):
    return e.split(" ")

l = ["a e", "i"]
print([q for c in l for q in clean(c) ])

['a', 'e', 'i']


In [2]:
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import json
import re

def clean_entry(entry):
    entry = entry.split("(")[0].replace("’","'")
    entries = re.split(r"[\s.\d/]+", entry)
    entry = entries[0]
    if (len(entry.replace("'", "").replace("-", "")) == 1 or entry.lower() != entry or entry[0] == "-" or entry[0] == "–" or entry[-1] == "-" or entry[-1] == ","):
        return None
    return [
        e.replace("–", "-").replace("­-", "-")
        for e in entries
        if len(e) > 1
        and not (
            len(e.replace("'", "").replace("-", "")) == 1
            or e.lower() != e
            or e[0] == "-"
            or e[0] == "–"
            or e[-1] == "-"
            or e[-1] == ","
            or "é" in e
            or "è" in e
            or "*" in e
        )
    ]

def scrape_dictionary_page(url):
    """
    Scrape dictionary entries from a given URL
    """    
    # Send HTTP request to the URL
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code != 200:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")
        return []
    
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find the dictionary entries (assuming they are within specific elements)
    # The selector may need adjustment based on the actual page structure
    entries = []
    
    # Look for dictionary entries - typically they might be in elements with class 'entry' or similar
    entries = [clean_entry(entry.text) for entry in soup.select('main .listelettre li a') if clean_entry(entry.text) != None]
    # Flatten the 2D array
    entries = [w for entry in entries for w in entry]
    
    return entries

def get_letter_entries(letter):
    print(f"Fetching the entries for {letter}")
    url = "https://devri.bzh/dictionnaire/" + letter
    last_page_reached = False
    page_number = 1
    all_entries_for_letter = []

    while not last_page_reached:
        new_entries = scrape_dictionary_page(f"{url}/page{page_number}")
        all_entries_for_letter += new_entries
        if len(new_entries) == 0:
            last_page_reached = True
        else:
            page_number += 1
    
    print(f"Found {len(all_entries_for_letter)} entries starting with '{letter}'")
    all_entries_for_letter = list(set(all_entries_for_letter))
    print(f"Found {len(all_entries_for_letter)} entries starting with '{letter}'")
    return all_entries_for_letter

def fetch_dictionary():
    url = "https://devri.bzh"
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code != 200:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")
        return []
    
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')
    letters = [letter.text for letter in soup.select('.rowmenulettres li a')]
    print("Letters in the dictionary:", letters)
    all_entries = []
    
    for letter in tqdm(letters[:]):
        all_entries += get_letter_entries(letter.replace("'", "-"))

    return list(set(all_entries))

entries = fetch_dictionary()
locale = "br"
entries.sort()

# Define the output file path
output_file_path = f"locales/{locale}/lemmas.json"

# Write the lemmas list to the JSON file
with open(output_file_path, 'w', encoding='utf-8') as outfile:
    json.dump(entries, outfile, ensure_ascii=False, indent=4)

print(f"{len(entries)} items extracted and saved to {output_file_path}")

Letters in the dictionary: ['a', 'b', 'ch', "c'h", 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z']


  0%|                              | 0/25 [00:00<?, ?it/s]

Fetching the entries for a


  4%|▉                     | 1/25 [00:21<08:29, 21.23s/it]

Found 5250 entries starting with 'a'
Found 4796 entries starting with 'a'
Fetching the entries for b


  8%|█▊                    | 2/25 [00:49<09:43, 25.38s/it]

Found 5850 entries starting with 'b'
Found 5137 entries starting with 'b'
Fetching the entries for ch


 12%|██▋                   | 3/25 [00:52<05:38, 15.38s/it]

Found 971 entries starting with 'ch'
Found 840 entries starting with 'ch'
Fetching the entries for c-h


 16%|███▌                  | 4/25 [00:54<03:27,  9.87s/it]

Found 383 entries starting with 'c-h'
Found 342 entries starting with 'c-h'
Fetching the entries for d


 20%|████▍                 | 5/25 [01:41<07:45, 23.26s/it]

Found 10402 entries starting with 'd'
Found 8801 entries starting with 'd'
Fetching the entries for e


 24%|█████▎                | 6/25 [01:55<06:24, 20.22s/it]

Found 2060 entries starting with 'e'
Found 1833 entries starting with 'e'
Fetching the entries for f


 28%|██████▏               | 7/25 [02:08<05:19, 17.74s/it]

Found 3018 entries starting with 'f'
Found 2602 entries starting with 'f'
Fetching the entries for g


 32%|███████               | 8/25 [02:29<05:18, 18.73s/it]

Found 4562 entries starting with 'g'
Found 4022 entries starting with 'g'
Fetching the entries for h


 36%|███████▉              | 9/25 [02:36<04:01, 15.07s/it]

Found 1864 entries starting with 'h'
Found 1633 entries starting with 'h'
Fetching the entries for i


 40%|████████▍            | 10/25 [02:41<03:00, 12.04s/it]

Found 882 entries starting with 'i'
Found 790 entries starting with 'i'
Fetching the entries for j


 44%|█████████▏           | 11/25 [02:44<02:10,  9.30s/it]

Found 582 entries starting with 'j'
Found 515 entries starting with 'j'
Fetching the entries for k


 48%|██████████           | 12/25 [03:25<04:07, 19.01s/it]

Found 8274 entries starting with 'k'
Found 7247 entries starting with 'k'
Fetching the entries for l


 52%|██████████▉          | 13/25 [03:41<03:37, 18.10s/it]

Found 3086 entries starting with 'l'
Found 2744 entries starting with 'l'
Fetching the entries for m


 56%|███████████▊         | 14/25 [04:01<03:24, 18.63s/it]

Found 3722 entries starting with 'm'
Found 3311 entries starting with 'm'
Fetching the entries for n


 60%|████████████▌        | 15/25 [04:05<02:20, 14.09s/it]

Found 785 entries starting with 'n'
Found 680 entries starting with 'n'
Fetching the entries for o


 64%|█████████████▍       | 16/25 [04:09<01:39, 11.01s/it]

Found 608 entries starting with 'o'
Found 542 entries starting with 'o'
Fetching the entries for p


 68%|██████████████▎      | 17/25 [04:41<02:18, 17.31s/it]

Found 5164 entries starting with 'p'
Found 4621 entries starting with 'p'
Fetching the entries for r


 72%|███████████████      | 18/25 [04:54<01:53, 16.20s/it]

Found 2675 entries starting with 'r'
Found 2268 entries starting with 'r'
Fetching the entries for s


 76%|███████████████▉     | 19/25 [05:19<01:52, 18.79s/it]

Found 5158 entries starting with 's'
Found 4664 entries starting with 's'
Fetching the entries for t


 80%|████████████████▊    | 20/25 [05:38<01:33, 18.73s/it]

Found 4542 entries starting with 't'
Found 4005 entries starting with 't'
Fetching the entries for u


 84%|█████████████████▋   | 21/25 [05:39<00:54, 13.60s/it]

Found 259 entries starting with 'u'
Found 234 entries starting with 'u'
Fetching the entries for v


 88%|██████████████████▍  | 22/25 [05:41<00:30, 10.14s/it]

Found 388 entries starting with 'v'
Found 356 entries starting with 'v'
Fetching the entries for w


 92%|███████████████████▎ | 23/25 [05:43<00:14,  7.48s/it]

Found 166 entries starting with 'w'
Found 153 entries starting with 'w'
Fetching the entries for y


 96%|████████████████████▏| 24/25 [05:46<00:06,  6.23s/it]

Found 270 entries starting with 'y'
Found 229 entries starting with 'y'
Fetching the entries for z


100%|█████████████████████| 25/25 [05:47<00:00, 13.89s/it]

Found 32 entries starting with 'z'
Found 30 entries starting with 'z'
62169 items extracted and saved to locales/br/lemmas.json



