# 2 Training the model
Before starting, we will load a list of lemmas from the previous part.

In [10]:
# Dump the lemmas to a json file
import json
import random

# Change this variable to load another list of lemmas
locale = "br"

# Define the file path
file_path = f"locales/{locale}/lemmas.json"

# Write the lemmas list to the JSON file
try:
    with open(file_path) as f:
        content = f.read()
        if not content.strip():
            raise ValueError("The JSON file is empty.")
        lemmas = json.loads(content)
except FileNotFoundError:
    print(f"Error: File not found at {file_path}")
    lemmas = []
except ValueError as e:
    print(f"Error: {e}")
    lemmas = []
except json.JSONDecodeError:
    print(f"Error: Invalid JSON content in {file_path}")
    lemmas = []

print(f"{len(lemmas)} items loaded from {file_path}")

62183 items loaded from locales/br/lemmas.json


## 2.1 Data Preparation
Now we can start tokenizing our data. In the context of a character-level language model, tokenizing means to turn the words that us human can read into sequences of numbers that the model can interpret.

In [11]:
# ensure you have the necessary library
%pip install 'numpy<2', torch


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [12]:
import torch
from torch.utils.data import Dataset, DataLoader

class CharDataset(Dataset):
    def __init__(self, sequences, vocab, separator_tag=None):
        self.sequences = sequences
        self.vocab = vocab
        self.char_to_idx = {char: idx for idx, char in enumerate(vocab)}
        self.idx_to_char = {idx: char for idx, char in enumerate(vocab)}
        if separator_tag != None:
            self.sep_tag = separator_tag

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        sequence = self.sequences[idx]
        input_seq = [self.char_to_idx[char] for char in sequence[:-1]]
        target_seq = [self.char_to_idx[char] for char in sequence[1:]]
        return torch.tensor(input_seq), torch.tensor(target_seq)

# In this case "vocab" is literally the latin alphabet
vocab = sorted(set("".join(lemmas)))
dataset = CharDataset(lemmas, vocab)

This loaded the lemmas in a dataset in a format that torch can understand. Each word is turned in a pair of sequences, an input (missing the last character) and a target (missing the first character). In this case, because the input sequences start with an added "start of sequence" special token, the target sequence is the full word. In plain English, this means that we also want our model to learn what is the most likely first letter of a word, not only the next most likely character based on the beginning of the sequence. 

All the characters are converted to numbers, each being the index of the input neuron that will be activated during the training. The system has as many inputs neurons, or input dimensions, as there are items in the vocabulary (by vocabulary, we mean alphabet). This is a reasonable number that allows the model to train on any computer, but imagine the size of a model when the vocabulary contains hundred of thousands of words (from different languages), and that each one needs its own input neuron... 

Run the following block to see how your data will be processed by the neural network.

In [13]:
from random import randrange
n = randrange(len(lemmas))

print(f"== {lemmas[n]} == \nbecomes the sequences:\n{dataset[n][0]} (input)\nand {dataset[n][1]} (target)")

== antierad == 
becomes the sequences:
tensor([ 3, 16, 21, 11,  7, 19,  3]) (input)
and tensor([16, 21, 11,  7, 19,  3,  6]) (target)


### 1.2 Grouping the sequences to learn
For convenience during both training and generation, we'll group the words in lists of a percent of the total number of words and separate each word by a special newline character "\n". We also extract five sequences for validation.

In [14]:
import random

random.shuffle(lemmas)
percent_len = len(lemmas)//100
sequences = ["\n" + "\n".join(lemmas[(n-1)*percent_len:n*percent_len])+ "\n" for n in range(1, 101)]
seq_training = sequences[:95]
seq_validating = sequences[95:]
vocab = sorted(set("".join(sequences)))
dataset = CharDataset(seq_training, vocab, "\n")
dataset_eval = CharDataset(seq_validating, vocab, "\n")
dataloader = DataLoader(dataset, shuffle=True)
dataloader_eval = DataLoader(dataset_eval, shuffle=True)

In [15]:
seq_validating

[w for w in ''.join(seq_training).split("\n") if len(w) and "z" == w[0]]

['zan',
 'zerasin',
 'zeerezh',
 'zingenn',
 'zeañ',
 'zoursilhañ',
 'zefoyan',
 'zo',
 'zeadur',
 'zeal',
 'zrodiñ',
 'zel',
 'zoken',
 'zou',
 'zero',
 'zoursilher',
 'zebu',
 'zrodad',
 'zamid',
 'zer',
 'zedacheg',
 'zouav',
 'zomuiken',
 'zink',
 "zokenoc'h",
 'zoursilh']

## 2.2 Defining the Model

In this part we design our network. We first initialize a PyTorch [module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module) by defining the different parts of the network: an embedding layer to turn each character in a 16 dimensional vector (an array of 16 numbers), one LSTM cell (`layers_number`) that will do the actual pattern recognition and prediction work and the linear fully connected (self.fc) layer converts these predictions in a simple discrete value, i.e. the index of the next character.

The forward function defines the order in which the input data will go through the network. It outputs the prediction and the updated hidden layer of the LSTM cells (these hidden states are updated even during the forward pass). And finally we have a function initializing the these hidden states with empty tensors of the good shape.

In [16]:
import torch.nn as nn

class LSTMModel(nn.Module):
    def __init__(self, embedding_dim=4, hidden_dim=16, layers_number=1, char_to_idx={}, idx_to_char={}):
        super().__init__()
        vocab_size = len(char_to_idx.keys())
        self.char_to_idx = char_to_idx
        self.idx_to_char = idx_to_char
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, layers_number, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    # The forward function is the one getting called everytime
    # the model created by an instance of this class is called
    # model(x, hidden) == model.forward(x, hidden)
    def forward(self, x, hidden):
        x = self.embedding(x)
        out, hidden = self.lstm(x, hidden)
        out = self.fc(out)
        return out, hidden

    def init_hidden(self, batch_size=1):
        return (torch.zeros(layers_number, batch_size , hidden_dim),
                torch.zeros(layers_number, batch_size , hidden_dim))

# Example usage
embedding_dim = 64
hidden_dim = 256
layers_number = 1
char_to_idx = dataset.char_to_idx
idx_to_char = dataset.idx_to_char

model = LSTMModel(embedding_dim, hidden_dim, layers_number, char_to_idx, idx_to_char)

total_params = sum(p.numel() for p in model.parameters())
print(f'Model ready! Total number of parameters: {total_params}')

Model ready! Total number of parameters: 343210


# 2.3 Training
After defining a couple of hyperparameters, we are ready to train our model.

In [20]:
import torch.optim as optim
from tqdm import tqdm

# Hyperparameters
num_epochs = 5
learning_rate = 0.005
vocab_size = len(char_to_idx)

# Loss function and optimizer
cross_entropy = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in tqdm(range(num_epochs)):
    # first, train the model
    model.train()
    hidden = model.init_hidden()
    training_loss = 0
    for inputs, targets in dataloader:
        optimizer.zero_grad()
        outputs, hidden = model.forward(inputs, hidden)
        loss = cross_entropy(outputs.view(-1, vocab_size), targets.squeeze(0))
        loss.backward()
        optimizer.step()
        training_loss = loss.item()
        hidden = (hidden[0].detach(), hidden[1].detach())
        
    # second, evaluate the model to avoid overfitting
    model.eval()
    total_loss = 0
    for inputs, targets in dataloader_eval:
        hidden = model.init_hidden()

        # forward pass
        outputs, hidden = model.forward(inputs, hidden)
        loss = cross_entropy(outputs.view(-1, vocab_size), targets.squeeze(0))
        total_loss += loss.item()

    avg_loss_eval = total_loss / len(dataloader_eval)
    print(f'Epoch [{epoch+1}/{num_epochs}], Training Loss: {avg_loss_eval:.4f}, Validation Loss: {training_loss:.4f}')


 20%|████▌                  | 1/5 [00:20<01:22, 20.57s/it]

Epoch [1/5], Training Loss: 1.7801, Validation Loss: 1.6473


 40%|█████████▏             | 2/5 [00:40<01:00, 20.18s/it]

Epoch [2/5], Training Loss: 1.7773, Validation Loss: 1.6552


 60%|█████████████▊         | 3/5 [00:58<00:38, 19.27s/it]

Epoch [3/5], Training Loss: 1.7761, Validation Loss: 1.6379


 80%|██████████████████▍    | 4/5 [01:17<00:18, 18.93s/it]

Epoch [4/5], Training Loss: 1.7797, Validation Loss: 1.6344


100%|███████████████████████| 5/5 [01:38<00:00, 19.61s/it]

Epoch [5/5], Training Loss: 1.7777, Validation Loss: 1.5924





### 2.4 Sampling generated sequences

In the following block, we can see how the model generates an array of probability for each character of the input sequence after

In [240]:
import torch.nn.functional as F

# First we disable the gradient calculation because we won't need it (no more backpropagation after the training)
# This makes the tensor representation cleaner
torch.set_grad_enabled(False)

hidden = model.init_hidden(1)
start_seq = [0, 3, 1]
inputs = torch.tensor(start_seq).unsqueeze(0)  # Shape: (1, seq_len)

outputs, hidden = model(inputs, hidden) # short for model.forward(inputs, hidden)

last_output = outputs[:, -1]
last_output[torch.where(last_output<0)] = 0
print("\nNext character weights vector (values below zero are set to zero):\n", last_output[0])


temperature = 0.01
last_output = last_output / temperature

print("\nWeights for the next charater with temperature scaling:\n", last_output)

probabilities = F.softmax(last_output, dim=-1).squeeze(0)

print("\nProbabilities for the next charater after scaling and with the softmax function:\n", probabilities)

# This is where the magic happens
# the mutinomial method samples (in this case) one item following the weights of the probability vector it recieves
predicted_idx = torch.multinomial(probabilities, 1).item()

print("Previous characters:", [dataset.idx_to_char[i] for i in start_seq])
print("Generated character:", dataset.idx_to_char[predicted_idx])

torch.set_grad_enabled(True)


Next character weights vector (values below zero are set to zero):
 tensor([0.0000, 0.0000, 0.0000, 0.0000, 2.1255, 0.0000, 0.0000, 0.9084, 2.0541,
        0.0000, 0.7476, 5.4077, 2.2497, 1.1581, 0.0000, 5.1731, 0.0000, 5.2352,
        0.0000, 0.0000, 4.8679, 0.0000, 3.6284, 0.3110, 0.0000, 2.0522, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000])

Weights for the next charater with temperature scaling:
 tensor([[  0.0000,   0.0000,   0.0000,   0.0000, 212.5451,   0.0000,   0.0000,
          90.8359, 205.4075,   0.0000,  74.7593, 540.7704, 224.9718, 115.8097,
           0.0000, 517.3084,   0.0000, 523.5228,   0.0000,   0.0000, 486.7871,
           0.0000, 362.8356,  31.0972,   0.0000, 205.2190,   0.0000,   0.0000,
           0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,
           0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,   0.0000,
           0.0

<torch.autograd.grad_mode.set_grad_enabled at 0x1632476b0>

Now we can try to generate some pseudo-words to "vibe check" how good is our newly trained model at the task it was created for, instead of cold cross-entropy results.

In [24]:
# import torch
import torch.nn.functional as F
from spylls.hunspell import Dictionary
import sys
print(len(lemmas))
dictionary = Dictionary.from_files(f"locales/{locale}/{locale}")


def generate_pseudoword(model, length=15, temperature=0.87):
    model.eval()
    hidden = model.init_hidden(1)
    start_seq = [0]
    inputs = torch.tensor(start_seq).unsqueeze(0)  # Shape: (1, seq_len)
    generated_seq = []
    words_generated = set([])

    with torch.no_grad():
        while len(words_generated) < length:
            outputs, hidden = model(inputs, hidden)

            # outputs shape: (1, seq_len, vocab_size)
            # We need the last time step's output for the next prediction
            last_output = outputs[:, -1]  # Shape: (1, vocab_size)

            # Apply temperature scaling
            last_output = last_output / temperature
            probs = F.softmax(last_output, dim=-1).squeeze(0)  # the multinomial accepts only one order tensors

            # Ensure all the probabilities are valid
            if torch.isnan(probs).any() or torch.isinf(probs).any() or (probs < 0).any():
                print("Invalid probabilities detected. Resetting to uniform distribution.")
                probs = torch.ones_like(probs) / probs.size(0)

            # Sample the next character
            predicted_idx = torch.multinomial(probs, 1).item()
            generated_seq.append(predicted_idx)
            inputs = torch.tensor([[predicted_idx]])  # Shape: (1, 1)

            if vocab[predicted_idx] == "\n":
                new_word = ''.join([vocab[i] for i in generated_seq[:-1]])
                generated_seq = []
                if not dictionary.lookup(new_word.capitalize()) and new_word not in lemmas:
                    words_generated.add(new_word)
                sys.stdout.write(f"\r{len(words_generated)} words so far")

    return list(sorted(words_generated))

# Example usage
generated_pseudoword = generate_pseudoword(model, 1800)
print()
print("\n".join(generated_pseudoword[-20:]))


62183
1800 words so far
tuviedenn
uniwell
unsoc'h
urliñ
urmer
ursod
valouaj
vanvezad
viedad
viedreiñ
vious
vis-kent
visienn
vitinañs
voazhañ
war-staet
war-wesk
yaouaran
youc'hiñ
yourc'hennek


# 5 Saving and loading our results

If you are happy with the results, like the loss, especially against the validation set, and the words generated, you can run the following block to save the model's weights.


In [21]:
# Save the best model you've trained so far
torch.save(model, f'locales/{locale}/lstm_model-{locale}.pth')

Or use this block to load a previously saved model to generate more non-words.

In [8]:
# generate words from the the last version of the model you saved
model = torch.load(f'locales/{locale}/lstm_model-{locale}.pth')

We can now generate our pseudo-lexicon. To find it, look out for the pseudo-lemmas.json file in the dictionary folder of your source dictionary.

In [25]:
# Dump the lemmas to a json file
import json
import time
start_time = time.time()

# Define the output file path
output_file_path = f"locales/{locale}/pseudo-lemmas.json"

generated_pseudoword = generate_pseudoword(model, len(lemmas))

# Write the lemmas list to the JSON file
with open(output_file_path, 'w', encoding='utf-8') as outfile:
    json.dump(generated_pseudoword, outfile, ensure_ascii=False, indent=4)

print()
time = time.time() - start_time

print(f"{len(generated_pseudoword)} pseudo words successfully generated and loaded in {time//60}:{(time%60):.2f}s")

61487 words so far
61487 pseudo words successfully generated and loaded in 32.0:36.84s
