# Name Generator for Dinosaurs

### @Author : HADDOU Amine

The goal of this project is to developp different neural network models to generates new dinosaur names.<br>
The objective is to developp the following two models : 
- n-grams model language
- a pre-trained model (from hugginface) finetuned to the project goal.

# Modules Import

In [1]:
import pandas as pd
import numpy as np

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split


import matplotlib.pyplot as plt
import os

# Discovering data

### Data Import

Importing data from `data/dinos.txt`

In [2]:
data = pd.read_csv("../data/dinos.txt", names=["dino_name"])

In [3]:
data.head()

Unnamed: 0,dino_name
0,Aachenosaurus
1,Aardonyx
2,Abdallahsaurus
3,Abelisaurus
4,Abrictosaurus


### Data Exploration

There is an overview of the imported data.

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1536 entries, 0 to 1535
Data columns (total 1 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   dino_name  1536 non-null   object
dtypes: object(1)
memory usage: 12.1+ KB


Exploring 20 random rows.

In [5]:
data.sample(20)

Unnamed: 0,dino_name
430,Dystylosaurus
1021,Panphagia
481,Eucercosaurus
1470,Xenoceratops
807,Lufengosaurus
950,Nurosaurus
1318,Tambatitanis
249,Camelotia
980,Oryctodromeus
151,Aurornis


Is there punctuation (composed names, etc) ? 

In [6]:
def print_non_alphabetical_chars(word):
    non_alphabetical = [char for char in word if char.lower() not in "abcdefghijklmnopqrstuvwxyz"]
    if non_alphabetical:
        print("Non-alphabetical characters in '{}': {}".format(word, ''.join(non_alphabetical)))
        return True
    return False

def print_non_alphabetical(data):
    presence = False # flag to check if there are any non-alphabetical characters
    for name in data["dino_name"]:
        if print_non_alphabetical_chars(name):
            presence = True
    if not presence:
        print("No non-alphabetical characters found in the dataset")

print_non_alphabetical(data)

No non-alphabetical characters found in the dataset


Does the data contains recurrent names ? If so, we should get rid of them later in the pre-processing. 

In [7]:
print("Number of duplicates in the dataset: {}".format(data.duplicated().sum()))

Number of duplicates in the dataset: 12


Let's explore words length.

In [8]:
# average length of dino names
avg_word_length = data["dino_name"].str.len().mean()

In [9]:
# max length of dino names
max_word_length = data["dino_name"].str.len().max()

In [10]:
# min length of dino names
min_word_length = data["dino_name"].str.len().min()

In [11]:
# quantile of dino names
quantile = data["dino_name"].str.len().quantile([0.25, 0.5, 0.75])

In [12]:
# number of unique dino names
unique_dino_names = data["dino_name"].nunique()

In [13]:
print(f"Average word length: {avg_word_length}")
print(f"Max word length: {max_word_length}")
print(f"Min word length: {min_word_length}")
print(f"Quantiles : {{25% : {quantile[0.25]}, 50% : {quantile[0.5]}, 75% : {quantile[0.75]}}}")

Average word length: 11.962239583333334
Max word length: 26
Min word length: 3
Quantiles : {25% : 10.0, 50% : 12.0, 75% : 13.0}


These results are interesting. Dinosaur names are, in average, very long. So, if I want to generate a name, model has to take in account enough context from previous letters to generate remaining part.

My first idea for the __n_gram model__ is to select `n_gram=3`. Because, I want to catch syllables in name. In my opinion, last syllables is enough context. It corresponds to the last 3 words before the word we want to predict.<br>
In this context, a *syllable* is defined as a sequence of three characters, usually at least the middle one is a vowel.<br>
It seems that using trigrams (3-grams) is a good choice since many words have lengths that are multiples of 3.

# Data Processing

## Lower Case Names

We should start with lower case all letters.

In [14]:
data["dino_name"] = data["dino_name"].str.lower()
data.head(2)

Unnamed: 0,dino_name
0,aachenosaurus
1,aardonyx


## Delete Recurrent values

It is time to delete previous identified dublicates.

In [15]:
data = data.drop_duplicates(subset='dino_name')
num_duplicates = data.duplicated().sum()
print(f"Number of duplicates in dino names: {num_duplicates}")

Number of duplicates in dino names: 0


## Padding

Let's add padding at the end of worlds to regularize names' length. We will try the `n-gram` model without padding.<br>
Maybe by adding a padding, the model will be able to learn __how__ dino names finish usally ("aurus", "tor", etc).

In [16]:
def add_padding(names: list[str], max_length: int) -> list[str]:
    padded_names = []
    for name in names :
        if len(name) < max_length:
            padded_names.append(name + "1" * (max_length - len(name)))
        else:
            padded_names.append(name[:max_length])
    return padded_names

In [17]:
data["paddded_dino_name"] = add_padding(data["dino_name"].values, max_word_length)

In [18]:
data.head(2)

Unnamed: 0,dino_name,paddded_dino_name
0,aachenosaurus,aachenosaurus1111111111111
1,aardonyx,aardonyx111111111111111111


## Start and End Token

Let's add `0` at the start and `1` at the end of names. The `0` will be usefull at the begenning of the generation to generate a first letter. For `k-gram`model, we will generate `k` `0` at the begenning to generate the first letter.

For the `1`, it will help us determine when the generation is finished. And it may help the model in *learning* how dino names end ("tor", "saurus",etc).. 

In [19]:
def add_start_end_tokens(names: list[str], n: int) -> list[str]:
    """Adds start '0' and end '1' tokens to each name."""
    return ["0" * n + name + "1" for name in names]

## Create Tokens

There is a first function to generates a list of n-grams from a list of names at a **character** level.

When building the vocabulary and tokens, we will add an `<UNK>` token that will be used for any unseen tokens or characters during generation. it is important because if the model produce a token which was not in the initial dataset, it won't be able to complete the generation.

In [20]:
def list_all_tokens(names: list[str], n: int) -> list[str]:
    """Generates and returns all unique n-grams from the names."""
    tokens = ["<UNK>"]  # Add the <UNK> token at the beginning
    for name in names:
        for i in range(len(name) - n + 1):
            token = name[i:i+n]
            if token not in tokens:
                tokens.append(token)
    return tokens

In [21]:
print(list_all_tokens(["dino", "dinosour", "dinosourus"], 4))

['<UNK>', 'dino', 'inos', 'noso', 'osou', 'sour', 'ouru', 'urus']


# 1) N-gram model

There is some usefull functions usefull for this model __only__. Each models needs his own tool functions. Some times the difference is only a line or a word.

## Tools

After dertermining tokens of language, it is also necessary to determine our vocabulary. In our case, it is all the characters.

In [22]:
def build_vocabulary(names: list[str]) -> list[str]:
    """Builds and returns a list of all unique characters (vocabulary) from the names."""
    vocab = ["<UNK>"]  # Add the <UNK> token at the beginning
    for name in names:
        for letter in name:
            if letter not in vocab:
                vocab.append(letter)
    return vocab

Now, let's compute probability of a letter appearing after a token.

In [23]:
def print_probabilities(probabilities: np.ndarray, tokens: list[str], vocab: list[str]) -> None:
    """Prints the probabilities of generating each letter for each token."""
    for i, token in enumerate(tokens):
        print(f"Token: {token}")
        for j, letter in enumerate(vocab):
            prob = probabilities[i][j]
            if prob > 0:
                print(f"    {letter}: {prob:.3f}")

In [24]:
def compute_probabilities(names: list[str], tokens: list[str], vocab: list[str], n: int) -> np.ndarray:
    """Computes the conditional probabilities of each token generating a letter from the vocabulary."""
    probabilities_array = np.zeros((len(tokens), len(vocab)))

    # Count occurrences of letters following each token
    for name in names:
        for i in range(len(name) - n):
            token = name[i:i+n]
            next_letter = name[i+n]  # The letter after the token
            if next_letter in vocab:
                token_index = tokens.index(token) if token in tokens else tokens.index("<UNK>")
                letter_index = vocab.index(next_letter)
                probabilities_array[token_index][letter_index] += 1
            else:
                print(f"Letter {next_letter} not in vocabulary")
                probabilities_array[tokens.index("<UNK>")][vocab.index("<UNK>")] += 1

    # Normalize the probabilities by dividing by row sums
    row_sums = probabilities_array.sum(axis=1, keepdims=True)
    probabilities_array = np.divide(probabilities_array, row_sums, where=row_sums != 0, out=probabilities_array)

    # print_probabilities(probabilities_array, tokens, vocab)
    
    return probabilities_array

To add some randomness and not generating alwas the same name, we will not retrieve the most predicted letter but one of the first k letters.

In [25]:
def choose_top_k_letters(probabilities: np.array, token_index: int, vocab: list[str], k: int = 5, verbose: bool = False) -> str:
    """Chooses one of the top k letters based on probabilities. Only selects letters with probability > 0."""
    
    # Sort the probabilities and get indices sorted by highest probability
    sorted_indices = np.argsort(probabilities[token_index])[::-1]  # Sort descending
    
    # Filter out indices where the probability is > 0
    valid_indices = [idx for idx in sorted_indices if probabilities[token_index][idx] > 0]

    # Adjust k if there are fewer than k valid options
    k = min(k, len(valid_indices))
    
    if k == 0:
        # Handle the case where no valid options with prob > 0 exist, fallback to <UNK> or any default behavior
        if verbose:
            print("No valid letters with probability > 0. Falling back to <UNK>.")
        return "<UNK>"  # or any other fallback behavior you'd like

    # Select the top k valid indices
    best_k_indices = valid_indices[:k]
    
    # Log top k predicted letters
    if verbose:
        print(f"\nTop {k} predicted letters for token index {token_index}:")
        for idx in best_k_indices:
            print(f"Letter: {vocab[idx]}, Probability: {probabilities[token_index][idx]}")

    # Choose one of the top k at random (or return the best if k=1)
    if k > 1:
        chosen_index = np.random.choice(best_k_indices)
    else:
        chosen_index = best_k_indices[-1]
    
    chosen_letter = vocab[chosen_index]
    if verbose:
        print(f"Chosen letter: {chosen_letter}")
    
    return chosen_letter

## Model

Finally, we have all the tools to create the model.

In [26]:
def ngram_predict_new_name(tokens: list[str], vocab: list[str], probabilities: np.array, n: int, max_inter_count : int = 10, k: int = 5, verbose: bool = False) -> str:
    """Generates a new name using the n-gram model with <UNK> token handling."""
    generated_name = ""  # Start with an empty string
    token = "0" * n  # Initial token is the start token

    if verbose:
        print(f"\nStarting name generation:")
    
    inter_count = 0
    while inter_count < max_inter_count:
        # Handle unknown token by using <UNK> token
        token_index = tokens.index(token) if token in tokens else tokens.index("<UNK>")
        
        if verbose:
            print(f"\nCurrent token: {token}")
            print(f"Name under construction: {generated_name}")
        
        if token == "0" * n:  # Start token: choose any letter
            next_letter = choose_top_k_letters(probabilities, token_index, vocab, k=len(vocab)-1, verbose=verbose)
        else:
            # Find the next letter using one of the top k probabilities
            next_letter = choose_top_k_letters(probabilities, token_index, vocab, k=k, verbose=verbose)
        
        if next_letter == "1":  # End of name (using '1')
            if verbose:
                print(f"End of name reached with letter: 1\n")
            break
        
        generated_name += next_letter
        token = (token + next_letter)[-n:]  # Shift the token by one letter, keeping it at length n
        inter_count += 1
    
    if verbose:
        print(f"Final generated name: {generated_name}\n")
    return generated_name

In [27]:
def ngram_model(names: list[str], n: int = 4, num_predictions: int = 5, max_length_output : int = 10, k: int = 5, verbose: bool = False) -> list[str]:
    """Trains an n-gram model and generates new names."""
    # Add start/end tokens to names and build model components
    n = n - 1 # n-gram model uses n-1 letters to predict the new one. So I'll adjust n here. 
    names = add_start_end_tokens(names, n)
    tokens = list_all_tokens(names, n)
    vocab = build_vocabulary(names)
    probabilities = compute_probabilities(names, tokens, vocab, n)
    
    # Generate new names based on the model
    generated_names = []
    for _ in range(num_predictions):
        name = ngram_predict_new_name(tokens, vocab, probabilities, n, max_inter_count=max_length_output, k=k, verbose=verbose)
        generated_names.append(name)
    
    return generated_names

## Generate Names with n-gramm

In [28]:
# Generate 5 names with n=4, top 5 letters considered, and verbose logging enabled
names = data["dino_name"].values
generated_names = ngram_model(names, n=2, max_length_output=max_word_length, num_predictions=5, k=3, verbose=True)

print(generated_names)


Starting name generation:

Current token: 0
Name under construction: 

Top 26 predicted letters for token index 1:
Letter: a, Probability: 0.1089238845144357
Letter: s, Probability: 0.09448818897637795
Letter: p, Probability: 0.07939632545931759
Letter: c, Probability: 0.07086614173228346
Letter: t, Probability: 0.06430446194225722
Letter: m, Probability: 0.05971128608923885
Letter: l, Probability: 0.05380577427821522
Letter: d, Probability: 0.0531496062992126
Letter: b, Probability: 0.04921259842519685
Letter: e, Probability: 0.04265091863517061
Letter: h, Probability: 0.04199475065616798
Letter: g, Probability: 0.03937007874015748
Letter: n, Probability: 0.031496062992125984
Letter: o, Probability: 0.02690288713910761
Letter: r, Probability: 0.026246719160104987
Letter: k, Probability: 0.025590551181102362
Letter: j, Probability: 0.01706036745406824
Letter: z, Probability: 0.01706036745406824
Letter: i, Probability: 0.015748031496062992
Letter: y, Probability: 0.015748031496062992
L

In [29]:
names = data["dino_name"].values
generated_names = ngram_model(names, n=3, max_length_output=max_word_length, num_predictions=5, k=2, verbose=False)

print(generated_names)

['rathusausaus', 'lanosuchinodrosucanodrathu', 'zhuale', 'yingon', 'xing']


In [30]:
# Generate 5 names with n=4, top 5 letters considered, and verbose logging enabled
names = data["dino_name"].values
generated_names = ngram_model(names, n=3, max_length_output=max_word_length, num_predictions=5, k=5, verbose=False)

print(generated_names)

['crang', 'mostelusanserotydolodzigar', 'ventos', 'ilodzistreocoekufengovecol', 'qansariocelinaxa']


In [31]:
# Generate 5 names with n=4, top 5 letters considered, and verbose logging enabled
names = data["dino_name"].values
generated_names = ngram_model(names, n=4, max_length_output=max_word_length, num_predictions=5, k=2, verbose=False)

print(generated_names)

['qinlonius', 'monosasparapelosauros', 'gong', 'barrhinuropeltongosaspleur', 'unenlagongongosucciniosucc']


In [32]:
# Generate 5 names with n=4, top 5 letters considered, and verbose logging enabled
names = data["dino_name"].values
generated_names = ngram_model(names, n=4, max_length_output=max_word_length, num_predictions=5, k=4, verbose=False)

print(generated_names)

['jingosuchunosuccinchunsaur', 'nomimucrocolamosphanoptery', 'hallasazisangsaurolopterot', 'zhongolosponodosaichthos', 'zhuchopsilonis']


In [33]:
# Generate 5 names with n=4, top 5 letters considered, and verbose logging enabled
names = data["dino_name"].values
generated_names = ngram_model(names, n=6, max_length_output=max_word_length, num_predictions=5, k=3, verbose=False)

print(generated_names)

['elrhazosauropteryx', 'nemegtomaia', 'marmarospondylosoma', 'sinraptorsaurushypacrosaur', 'wiehenvenatrix']


Results of my generator with different parameters' values : 
-  `n = 3` and `k = 2`<br>
['neoversosuccingshadros', 'bator', 'quilmayisaurutichodosuccin', 'dasylossus', 'inosphagros']
-  `n = 3` and `k = 3`<br>
['yungonius', 'elaplossuesiohadromaia', 'venescelusothostospinax', 'xingsauros', 'euskelyx']
-  `n = 2` and `k = 3`<br>
['heisaudasilis', 'kan', 'ale', 'xiasaustesaudiangobistrops', 'raptastriong']
- `n = 4` and `k=3`<br>
['unicerosaurophale', 'yurgovuchia', 'ischyrophus', 'ovirapterovenatosaurus', 'magnapartenykus']
- `n = 4` and `k=5`<br>
['cristatus', 'vouivria', 'yongjianosaurutitanius', 'mojoceratusauravusaurornit', 'epachthosuchomimoides']

It looks like with a 4-gram model, we generate name that looks like real dino names. Even with a high randomness (k=5), results are still good.

Even with the 3-gram model, the results are good (with low randomness).

## N-gram Model Summary

An **n-gram model** is a type of probabilistic model used for generating sequences (in this case, names) by predicting the next element in a sequence based on the previous **n-1** elements. The model learns the probability distribution of letters following specific **n-grams** (substrings of length `n`) from a training dataset.

This implementation generates new names one letter at a time using the learned probability distribution from the training data. The generated name is built step by step by predicting the next letter based on the preceding **n-1** letters.

---

### How the Model Works:
1. **Training Phase**:
   - The model processes the list of names to build the vocabulary and n-gram tokens.
   - It calculates the probability of each letter occurring after each token based on the training data.
   
2. **Generation Phase**:
   - The model generates new names by starting with a special start token (e.g., `"000"` for `n=4`).
   - For each step, it uses the previous n-1 letters to predict the next letter.
   - The next letter is chosen based on the learned probabilities, and the process continues until an end token (`"1"`) is generated or the name reaches the maximum length (`max_length_output`).

---

### Example Usage:

```python
# Example test data (dinosaur names)
dino_names = ["Tyrannosaurus", "Stegosaurus", "Triceratops"]

# Generate 5 names with n=4, maximum name length of 10, and top 3 letters considered
generated_names = ngram_model(dino_names, n=4, num_predictions=5, max_length_output=10, k=3, verbose=True)

print(generated_names)
```

This example generates 5 names with a trigram model, considers the top 3 letters at each step, and limits the names to 10 characters max.

# 2) LSTM model - Simple

LSTM It is a type of RNN model usuall used for sequence prediction tasks. It is capable of learning long-term dependencies in data. This is the reason why it is used for text generation tasks.

I'll try to implement the model in a way that the prediction take in account the position of the futur predicted letter __with__ it context. I'll come back on this point later.

## Tools

As we did before, we have to firstly define the vocabulary of our language again. It will be different as before because I decided here to define each char as an element of the vocabulary.

In [34]:
def build_vocabulary_classique_LSTM(names : list[str]) :
    chars = sorted(list(set(''.join(names))))
    vocab = list(dict.fromkeys(chars)) # make sure there are no duplicates
    char_to_index = {char: idx for idx, char in enumerate(vocab)}
    index_to_char = {idx: char for idx, char in enumerate(vocab)}
    vocab_size = len(vocab)
    return vocab, char_to_index, index_to_char, vocab_size

In [35]:
def add_padding_to_seq(seq: list[int], max_length: int) -> list[int]:
    return [0] * (max_length - len(seq)) + seq

After this definition, we may encode the data. Otherwise, the model won't be able to learn from dino names.

And as we can see in the following code, `seq_in` is a sequence that grow with one letter from `name` at each iteration of `i`. When `seq_in`is too small, a padding of `0`is added at the beginning of the word. `0` is here the starter marker. So, this may help the model to take in account the position of the word indirectly.

In [36]:
def encoding(names: list[str], char_to_index: dict, sequence_length: int = 15) -> tuple[np.array, np.array]: 
    data_X, data_y = [], []
    for name in names:
        encoded_name = [char_to_index[char] for char in name]
        for i in range(1, len(encoded_name)):
            seq_in = encoded_name[:i]
            seq_out = encoded_name[i]
            seq_in = pad_sequences([seq_in], maxlen=sequence_length, padding='pre')[0]
            data_X.append(seq_in)
            data_y.append(seq_out)
    return np.array(data_X), np.array(data_y)

Function that use previous functions to make last adjustement on dino names (last processing) and then returnin training data, vocabulary and other important informations for the generation.

In [37]:
def prepare_training_data(data: pd.DataFrame, max_length: int) -> tuple[np.array, np.array, list[str], dict, dict, int]:
    names = data["dino_name"].values
    names = add_start_end_tokens(names, 1)
    vocab, char_to_index, index_to_char, vocab_size = build_vocabulary_classique_LSTM(names)
    names = add_padding(names, max_length=max_word_length)
    data_X, data_y = encoding(names, char_to_index, sequence_length=max_word_length)
    return data_X, data_y, vocab, char_to_index, index_to_char, vocab_size

## Model

The model we are using to generate dinosaur names is based on an LSTM neural network. Here is the configuration of the layers:

- **Input Layer**: Accepts input sequences of length `sequence_length`.
- **Embedding Layer**: Converts each character (represented by an integer index) into an 13-dimensional vector, helping the model learn relationships between characters. The embedding has `input_dim=vocab_size`, `output_dim=13`, and `input_length=sequence_length`.
- **LSTM Layer**: Contains 128 units, allowing the model to learn the temporal dependencies between characters in the sequence, crucial for generating plausible names.
- **Dense Layer**: Outputs a probability distribution over the `vocab_size` possible characters using the `softmax` activation function. This allows the model to predict the next character in the sequence.

In [38]:
def create_LSTM_model(vocab_size: int, sequence_length: int, embedding_size: int = 13, hidden_units: int = 128, show_summary: bool = False) -> Model:
    inputs = Input(shape=(sequence_length,))
    embedding = Embedding(input_dim=vocab_size, output_dim=embedding_size)(inputs)
    lstm = LSTM(hidden_units)(embedding)
    outputs = Dense(vocab_size, activation='softmax')(lstm)
    model = Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
    if show_summary:
        model.summary()
    return model

A function to train. It will be used later for other models

In [39]:
def training(model: Model, data_X: np.array, data_y: np.array, epochs: int = 100, batch_size: int = 32) -> Model:
    # Train the model
    mycallback = EarlyStopping(monitor='loss', patience=5)
    model.fit(data_X, data_y, epochs=epochs, batch_size=batch_size, callbacks=[mycallback])
    return model

In [40]:
def plot_training_history(history):
    # Plot training & validation loss values
    plt.figure(figsize=(12, 6))
    
    # Plot training & validation accuracy values
    plt.plot(history.history['accuracy'])
    plt.plot(history.history['val_accuracy'])
    plt.title('Model accuracy')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Validation'], loc='upper left')
    plt.show()

## Training

firstly, let's excute the preporcess on the data.

In [41]:
data_X, data_y, vocab, char_to_index, index_to_char, vocab_size = prepare_training_data(data, max_word_length)

Model initiallisation and training.

In [42]:
lstm_model = create_LSTM_model(vocab_size, max_word_length)
lstm_model = training(lstm_model, data_X, data_y, epochs=50, batch_size=32)

Epoch 1/100
[1m1191/1191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 16ms/step - loss: 1.5065
Epoch 2/100
[1m1191/1191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 17ms/step - loss: 1.0493
Epoch 3/100
[1m1191/1191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 17ms/step - loss: 0.9786
Epoch 4/100
[1m1191/1191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 18ms/step - loss: 0.9283
Epoch 5/100
[1m1191/1191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 16ms/step - loss: 0.8851
Epoch 6/100
[1m1191/1191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 16ms/step - loss: 0.8532
Epoch 7/100
[1m1191/1191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 16ms/step - loss: 0.8238
Epoch 8/100
[1m1191/1191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 16ms/step - loss: 0.7845
Epoch 9/100
[1m1191/1191[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 16ms/step - loss: 0.7703
Epoch 10/100
[1m1191/1191[0m [32m━━━━━━━━━━

KeyboardInterrupt: 

## Generation of name

Now we have a trained LSTM. We can implement new functions to generates the new dino names with the LSTM generator.

### Function to generate a new name

the goal of the following function is to introduce some kind of randomness in the generator. Not alway predicting the most probable letter but one the k-most probable.

In [39]:
def choose_top_k_letters_from_model(predictions: np.array, k: int = 5) -> int:
    """Chooses one of the top k letters based on probabilities. Only selects letters with probability > 0."""
    sorted_indices = np.argsort(predictions)[::-1]  # Sort descending
    # Filter out indices where the probability is > 0
    valid_indices = [idx for idx in sorted_indices if predictions[idx] > 0]
    # Adjust k if there are fewer than k valid options
    k = min(k, len(valid_indices))
    if k == 0:
        print("No valid letters with probability > 0.")
        # Return random letter if no valid letter found (except 0)
        letter = 0
        while letter == 0:
            letter = np.random.choice(len(predictions))
        return letter

    # Select the top k valid indices
    best_k_indices = valid_indices[:k]
    
    # Choose one of the top k at random (or return the best if k=1)
    if k > 1:
        chosen_index = np.random.choice(best_k_indices)
    else:
        chosen_index = best_k_indices[-1]
    
    return chosen_index

There is a function to generate one new name only.

In [40]:
def generate_name(model: "Model", char_to_index: dict, index_to_char: dict, sequence_length: int, max_length: int = 20, k: int = 5) -> str:
    name = [0]  # start token ('0')
    while len(name) < max_length:
        seq_in = add_padding_to_seq(name, sequence_length)
        prediction = model.predict(np.array([seq_in]), verbose=0)
        next_index = choose_top_k_letters_from_model(prediction[0], k)
        if next_index == 1:  # end token ('1')
            break
        name.append(next_index)
    return ''.join([index_to_char[idx] for idx in name[1:]])

A function to generates multiple names at once.

In [41]:
def generate_n_names(model: "Model", char_to_index: dict, index_to_char: dict, sequence_length: int, n: int, k: int = 5) -> list[str]:
    names = []
    for _ in range(n):
        name = generate_name(model, char_to_index=char_to_index, index_to_char=index_to_char, sequence_length=sequence_length, k=k)
        names.append(name)
    return names

### Generations

`k`(temperature) = 3

In [42]:
names = generate_n_names(lstm_model, char_to_index=char_to_index, index_to_char=index_to_char, sequence_length=max_word_length, n=5, k=3)
print(names)

['siarontentongadisus', 'sauiclaloscoidolove', 'albisprostyiamisabh', 'panplicomumiora', 'pangongonykur']


`k`= 2

In [43]:
names = generate_n_names(lstm_model, char_to_index=char_to_index, index_to_char=index_to_char, sequence_length=max_word_length, n=5, k=2)
print(names)

['siamodrator', 'saltrisomimoigery', 'antaltacidaxauropst', 'sanjunx', 'saltrisapra']


`k`= 6

In [44]:
names = generate_n_names(lstm_model, char_to_index=char_to_index, index_to_char=index_to_char, sequence_length=max_word_length, n=5, k=6)
print(names)

['saldsplrrug', 'ariyopithnunikr', 'pideiragongevantoli', 'anopslumemovenaturu', 'crupzognig']


Exemple of an output for `k=6`: `['conguionniodavithya', 'chesallientapatotho', 'priantaleor', 'stocusudhaleus', 'ausarodynnimiegitol']`

# 3) *N-gram* based LSTM Models

Results on previous model were acceptable, even good sometimes. But let's try another approach.

The vocabulary will contains usuall letters, and in addition, the *n-gram token*. tokens may help to identify patterns and enhance the prediction. 

## Tools

As we did for previous models, we define in this subsection usefull functions to train, and generate dino names with the new model.

Here, we bring the modification in teh vocabulary I talked about.

In [21]:
def list_all_tokens(names: list[str], n: int) -> list[str]:
    tokens = ["<UNK>", "0", "1"]  # Add the <UNK> token and padding character '1' at the beginning

    def get_letters_in_name(names: list[str]) -> list[str]:
        letters = []
        for name in names :
            for l in name :
                if l not in letters:
                    letters.append(l)
        return letters
    
    tokens += get_letters_in_name(names)
    for name in names:
        for i in range(len(name) - n + 1):
            token = name[i:i+n]
            if token not in tokens:
                tokens.append(token)
    return tokens

In [22]:
# Builds vocabulary and mappings for LSTM with specified n-gram tokens.
def build_n_vocabulary_for_lstm(names: list[str], n: int = 1):
    names = add_start_end_tokens(names, n)
    tokens = list_all_tokens(names, n)
    char_to_index = {char: idx for idx, char in enumerate(tokens)}
    index_to_char = {idx: char for idx, char in enumerate(tokens)}
    vocab_size = len(tokens)
    return tokens, char_to_index, index_to_char, vocab_size

In [23]:
# Adds padding to make sequences of equal length.
def add_padding(names: list[str], max_length: int) -> list[str]:
    return [name.rjust(max_length, '1') for name in names]

In [24]:
def encoding(names: list[str], char_to_index: dict, sequence_length: int, ngram: int):
    data_X, data_y = [], []
    for name in names:
        encoded_name = [char_to_index[char] for char in name]
        for i in range(0, len(encoded_name) - ngram):
            data_X.append(encoded_name[i:i+ngram])
            data_y.append(encoded_name[i+ngram])  # Use only the next single token
    data_X = np.array(data_X)
    data_y = np.array(data_y)
    print(f"data_X shape: {data_X.shape}, data_y shape: {data_y.shape}")
    return data_X, data_y

And a last function to that will call back the previous ones. 

In [25]:
# Prepares training data for the LSTM.
def prepare_n_training_data(names: list[str], ngram=4, max_length: int = 15):
    names = add_start_end_tokens(names, n=ngram)
    names = add_padding(names, max_length=max_length)
    vocab, char_to_index, index_to_char, vocab_size = build_n_vocabulary_for_lstm(names, ngram)
    data_X, data_y = encoding(names, char_to_index, sequence_length=max_length, ngram=ngram)
    return data_X, data_y, vocab, char_to_index, index_to_char, vocab_size

## Model(s)

In [26]:
# Creates an LSTM model.
def create_LSTM_model(vocab_size: int, sequence_length: int, embedding_size: int = 13, hidden_units: int = 128, dropout_rate: float = 0.2, show_summary: bool = False) -> Model:
    inputs = Input(shape=(sequence_length,))
    embedding = Embedding(input_dim=vocab_size, output_dim=embedding_size)(inputs)
    lstm = LSTM(hidden_units, return_sequences=False)(embedding)
    dropout = Dropout(dropout_rate)(lstm)
    outputs = Dense(vocab_size, activation='softmax')(dropout)
    model = Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    if show_summary:
        model.summary()
    return model

In [27]:
# Trains the LSTM model.
def training(model: Model, data_X: np.array, data_y: np.array, epochs: int = 100, batch_size: int = 32) -> Model:
    data_X_train, data_X_val, data_y_train, data_y_val = train_test_split(data_X, data_y, test_size=0.2, random_state=42)
    mycallback = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
    model.fit(data_X_train, data_y_train, validation_data=(data_X_val, data_y_val), epochs=epochs, batch_size=batch_size, callbacks=[mycallback])
    return model

## Functions to generate new names

In [28]:
# Chooses one of the top k letters based on probabilities.
def choose_top_k_letters_from_model(predictions: np.array, k: int = 5) -> int:
    sorted_indices = np.argsort(predictions)[::-1]
    valid_indices = [idx for idx in sorted_indices if predictions[idx] > 0]
    k = min(k, len(valid_indices))
    if k == 0:
        return np.random.randint(1, len(predictions))
    best_k_indices = valid_indices[:k]
    return np.random.choice(best_k_indices) if k > 1 else best_k_indices[0]

In [29]:
# Generates a single name using the trained LSTM model.
def generate_name(model: "Model", char_to_index: dict, index_to_char: dict, max_length: int, ngram : int, k: int = 5) -> str:
    # Start with the start token and a random letter
    name = [0] + [np.random.randint(3, len(char_to_index))] # Exclude start token and markers tokens "0" and "1"
    while len(name) < max_length:
        seq_in = pad_sequences([name[-ngram:]], maxlen=ngram, padding='pre')  # Generate based on the last n tokens
        prediction = model.predict(seq_in, verbose=0)
        next_index = choose_top_k_letters_from_model(prediction[0], k)
        next_letter = index_to_char[next_index]
        if next_letter == '1':
            break
        name.append(next_index)
    return ''.join([index_to_char[idx] for idx in name[1:]])

In [30]:
# Generates multiple names using the trained LSTM model.
def generate_names(model: "Model", char_to_index: dict, index_to_char: dict, max_length: int, ngram : int, n: int, k: int = 5) -> list[str]:
    names = []
    for _ in range(n):
        name = generate_name(model, char_to_index=char_to_index, index_to_char=index_to_char, max_length=max_length, ngram=ngram, k=k)
        names.append(name)
    return names

## Generation of New Names

Let's create multiple *n-gram* based LSTM model. We will vary the parameter `ngram`for each model. So main difference between models will be their vocabulary.

### Exemple 1 - 2
> ngram = 2

In [31]:
n = 2
data_X, data_y, vocab, char_to_index, index_to_char, vocab_size = prepare_n_training_data(data["dino_name"], ngram=n, max_length=max_word_length)

data_X shape: (36581, 2), data_y shape: (36581,)


In [32]:
lstm_model_2 = create_LSTM_model(vocab_size, sequence_length=n, show_summary=True)
lstm_model_2 = training(lstm_model_2, data_X, data_y, epochs=50, batch_size=32)

Epoch 1/5
[1m915/915[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.4461 - loss: 2.6560 - val_accuracy: 0.5591 - val_loss: 1.4846
Epoch 2/5
[1m915/915[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4ms/step - accuracy: 0.5882 - loss: 1.4409 - val_accuracy: 0.6104 - val_loss: 1.3594
Epoch 3/5
[1m915/915[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4ms/step - accuracy: 0.6194 - loss: 1.3266 - val_accuracy: 0.6233 - val_loss: 1.2823
Epoch 4/5
[1m915/915[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4ms/step - accuracy: 0.6306 - loss: 1.2735 - val_accuracy: 0.6344 - val_loss: 1.2376
Epoch 5/5
[1m915/915[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4ms/step - accuracy: 0.6309 - loss: 1.2477 - val_accuracy: 0.6477 - val_loss: 1.2070


In [33]:
names = generate_names(lstm_model_2, char_to_index=char_to_index, index_to_char=index_to_char, max_length=max_word_length, n=5, ngram=n, k=2)
names

['suatar', 'ocantanonstrapanos', 'onataraptoraton', 'amonsanon', 'janton']

In [34]:
names = generate_names(lstm_model_2, char_to_index=char_to_index, index_to_char=index_to_char, max_length=max_word_length, n=5, ngram=n, k=5)
names

['jaianalis',
 'ryonaltepstonstilelon',
 'leetophosistelisolinuasaog',
 'lntirastenodopthenachaltas',
 'awortiatelongurantirothons']

### Exemple 3-4
> ngram = 3

In [35]:
n = 3
data_X, data_y, vocab, char_to_index, index_to_char, vocab_size = prepare_n_training_data(data["dino_name"], ngram=n, max_length=max_word_length)

data_X shape: (35061, 3), data_y shape: (35061,)


In [36]:
lstm_model_3 = create_LSTM_model(vocab_size, sequence_length=n)
lstm_model_3 = training(lstm_model_3, data_X, data_y, epochs=50, batch_size=32)

Epoch 1/100
[1m877/877[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 8ms/step - accuracy: 0.3867 - loss: 3.0788 - val_accuracy: 0.5637 - val_loss: 1.5692
Epoch 2/100
[1m877/877[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.5528 - loss: 1.5491 - val_accuracy: 0.5851 - val_loss: 1.4285
Epoch 3/100
[1m877/877[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.5882 - loss: 1.4267 - val_accuracy: 0.6077 - val_loss: 1.3255
Epoch 4/100
[1m877/877[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.6084 - loss: 1.3222 - val_accuracy: 0.6150 - val_loss: 1.2914
Epoch 5/100
[1m877/877[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.6209 - loss: 1.2788 - val_accuracy: 0.6308 - val_loss: 1.2524
Epoch 6/100
[1m877/877[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - accuracy: 0.6286 - loss: 1.2529 - val_accuracy: 0.6355 - val_loss: 1.2299
Epoch 7/100
[1m877/87

KeyboardInterrupt: 

In [60]:
names = generate_names(lstm_model_3, char_to_index=char_to_index, index_to_char=index_to_char, max_length=max_word_length, n=5, ngram=n, k=3)
names

['ustyctonychangjospedmeryp',
 'uamon',
 'onglospeng',
 'onyclanathangarinodassusi',
 'racericetegnestrintospeno']

In [76]:
names = generate_names(lstm_model_3, char_to_index=char_to_index, index_to_char=index_to_char, max_length=max_word_length, n=5, ngram=n, k=5)
names

['unategraspidilosplonycost',
 'alangleirururson',
 'ulodeiamperyaspleosuteste',
 'shustassibonistonisarssib',
 'ropsodreadociantodranosar']

### Exemple 4
> ngram = 4

n = 4
data_X, data_y, vocab, char_to_index, index_to_char, vocab_size = prepare_n_training_data(data["dino_name"], ngram=n, max_length=max_word_length)

In [None]:
lstm_model_4 = create_LSTM_model(vocab_size, sequence_length=n)
lstm_model_4 = training(lstm_model_3, data_X, data_y, epochs=50, batch_size=32)

In [None]:
names = generate_names(lstm_model_4, char_to_index=char_to_index, index_to_char=index_to_char, max_length=max_word_length, n=5, ngram=n, k=3)
names

In [None]:
names = generate_names(lstm_model_4, char_to_index=char_to_index, index_to_char=index_to_char, max_length=max_word_length, n=5, ngram=n, k=5)
names