# Name Generator for Dinosaurs

### @Author : HADDOU Amine

The goal of this project is to developp different neural network models to generates new dinosaur names.<br>
The objective is to developp the following two models : 
- n-grams model language
- a pre-trained model (from hugginface) finetuned to the project goal.

# Modules Import

In [209]:
import pandas as pd
import numpy as np

from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# Discovering data

### Data Import

Importing data from `data/dinos.txt`

In [210]:
data = pd.read_csv("../data/dinos.txt", names=["dino_name"])

In [211]:
data.head()

Unnamed: 0,dino_name
0,Aachenosaurus
1,Aardonyx
2,Abdallahsaurus
3,Abelisaurus
4,Abrictosaurus


### Data Exploration

There is an overview of the imported data.

In [212]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1536 entries, 0 to 1535
Data columns (total 1 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   dino_name  1536 non-null   object
dtypes: object(1)
memory usage: 12.1+ KB


Exploring 20 random rows.

In [213]:
data.sample(20)

Unnamed: 0,dino_name
654,Indosuchus
742,Lametasaurus
1146,Rhabdodon
1357,Texacephale
1235,Siamodon
38,Alamosaurus
3,Abelisaurus
76,Anabisetia
57,Alocodon
1396,Triceratops


Is there punctuation (composed names, etc) ? 

In [214]:
def print_non_alphabetical_chars(word):
    non_alphabetical = [char for char in word if char.lower() not in "abcdefghijklmnopqrstuvwxyz"]
    if non_alphabetical:
        print("Non-alphabetical characters in '{}': {}".format(word, ''.join(non_alphabetical)))
        return True
    return False

def print_non_alphabetical(data):
    presence = False # flag to check if there are any non-alphabetical characters
    for name in data["dino_name"]:
        if print_non_alphabetical_chars(name):
            presence = True
    if not presence:
        print("No non-alphabetical characters found in the dataset")

print_non_alphabetical(data)


No non-alphabetical characters found in the dataset


Let's explore words length.

In [215]:
# average length of dino names
avg_word_length = data["dino_name"].str.len().mean()

In [216]:
# max length of dino names
max_word_length = data["dino_name"].str.len().max()

In [217]:
# min length of dino names
min_word_length = data["dino_name"].str.len().min()

In [218]:
# quantile of dino names
quantile = data["dino_name"].str.len().quantile([0.25, 0.5, 0.75])

In [219]:
# number of unique dino names
unique_dino_names = data["dino_name"].nunique()

In [220]:
print(f"Average word length: {avg_word_length}")
print(f"Max word length: {max_word_length}")
print(f"Min word length: {min_word_length}")
print(f"Quantiles : {{25% : {quantile[0.25]}, 50% : {quantile[0.5]}, 75% : {quantile[0.75]}}}")

Average word length: 11.962239583333334
Max word length: 26
Min word length: 3
Quantiles : {25% : 10.0, 50% : 12.0, 75% : 13.0}


These results are interesting. Dinosaur names are, in average, very long. So, if I want to generate a name, model has to take in account enough context from previous letters to generate remaining part.

My first idea for the __n_gram model__ is to select `n_gram=3`. Because, I want to catch syllables in name. In my opinion, last syllables is enough context. It corresponds to the last 3 words before the word we want to predict.<br>
In this context, a *syllable* is defined as a sequence of three characters, usually at least the middle one is a vowel.<br>
It seems that using trigrams (3-grams) is a good choice since many words have lengths that are multiples of 3.

# Data Processing

## Lower Case Names

We should start with lower case all letters.

In [221]:
data["dino_name"] = data["dino_name"].str.lower()
data.head(2)

Unnamed: 0,dino_name
0,aachenosaurus
1,aardonyx


## Padding - Not needed for now

Let's add padding at the end of worlds to regularize names' length. We will try the `n-gram` model without padding.<br>
Maybe by adding a padding, the model will be able to learn __how__ dino names finish usally ("aurus", "tor", etc).

In [222]:
def add_padding(names: list[str], max_length: int) -> list[str]:
    padded_names = []
    for name in names :
        if len(name) < max_length:
            padded_names.append(name + "0" * (max_length - len(name)))
        else:
            padded_names.append(name)
    return padded_names

In [223]:
data["paddded_dino_name"] = add_padding(data["dino_name"].values, max_word_length)

In [224]:
data.head(2)

Unnamed: 0,dino_name,paddded_dino_name
0,aachenosaurus,aachenosaurus0000000000000
1,aardonyx,aardonyx000000000000000000


# Start and End Token

Let's add `0` at the start and `1` at the end of names. The `0` will be usefull at the begenning of the generation to generate a first letter. For `k-gram`model, we will generate `k` `0` at the begenning to generate the first letter.

For the `1`, it will help us determine when the generation is finished. And it may help the model in *learning* how dino names end ("tor", "saurus",etc).. 

In [225]:
def add_start_end_tokens(names: list[str], n: int) -> list[str]:
    """Adds start '0' and end '1' tokens to each name."""
    return ["0" * n + name + "1" for name in names]

# N-gram model

There is a first function to generates a list of n-grams from a list of names at a **character** level.

When building the vocabulary and tokens, we will add an `<UNK>` token that will be used for any unseen tokens or characters during generation. it is important because if the model produce a token which was not in the initial dataset, it won't be able to complete the generation.

In [227]:
def list_all_tokens(names: list[str], n: int) -> list[str]:
    """Generates and returns all unique n-grams from the names."""
    tokens = ["<UNK>"]  # Add the <UNK> token at the beginning
    for name in names:
        for i in range(len(name) - n + 1):
            token = name[i:i+n]
            if token not in tokens:
                tokens.append(token)
    return tokens

In [228]:
print(list_all_tokens(["dino", "dinosour", "dinosourus"], 4))

['<UNK>', 'dino', 'inos', 'noso', 'osou', 'sour', 'ouru', 'urus']


After dertermining tokens of language, it is also necessary to determine our vocabulary. In our case, it is all the characters.

In [229]:
def build_vocabulary(names: list[str]) -> list[str]:
    """Builds and returns a list of all unique characters (vocabulary) from the names."""
    vocab = ["<UNK>"]  # Add the <UNK> token at the beginning
    for name in names:
        for letter in name:
            if letter not in vocab:
                vocab.append(letter)
    return vocab

Now, let's compute probability of a letter appearing after a token.

In [230]:
def print_probabilities(probabilities: np.ndarray, tokens: list[str], vocab: list[str]) -> None:
    """Prints the probabilities of generating each letter for each token."""
    for i, token in enumerate(tokens):
        print(f"Token: {token}")
        for j, letter in enumerate(vocab):
            prob = probabilities[i][j]
            if prob > 0:
                print(f"    {letter}: {prob:.3f}")

In [231]:
def compute_probabilities(names: list[str], tokens: list[str], vocab: list[str], n: int) -> np.ndarray:
    """Computes the conditional probabilities of each token generating a letter from the vocabulary."""
    probabilities_array = np.zeros((len(tokens), len(vocab)))

    # Count occurrences of letters following each token
    for name in names:
        for i in range(len(name) - n):
            token = name[i:i+n]
            next_letter = name[i+n]  # The letter after the token
            if next_letter in vocab:
                token_index = tokens.index(token) if token in tokens else tokens.index("<UNK>")
                letter_index = vocab.index(next_letter)
                probabilities_array[token_index][letter_index] += 1
            else:
                print(f"Letter {next_letter} not in vocabulary")
                probabilities_array[tokens.index("<UNK>")][vocab.index("<UNK>")] += 1

    # Normalize the probabilities by dividing by row sums
    row_sums = probabilities_array.sum(axis=1, keepdims=True)
    probabilities_array = np.divide(probabilities_array, row_sums, where=row_sums != 0, out=probabilities_array)

    # print_probabilities(probabilities_array, tokens, vocab)
    
    return probabilities_array

To add some randomness and not generating alwas the same name, we will not retrieve the most predicted letter but one of the first k letters.

In [232]:
def choose_top_k_letters(probabilities: np.array, token_index: int, vocab: list[str], k: int = 5, verbose: bool = False) -> str:
    """Chooses one of the top k letters based on probabilities. Only selects letters with probability > 0."""
    
    # Sort the probabilities and get indices sorted by highest probability
    sorted_indices = np.argsort(probabilities[token_index])[::-1]  # Sort descending
    
    # Filter out indices where the probability is > 0
    valid_indices = [idx for idx in sorted_indices if probabilities[token_index][idx] > 0]

    # Adjust k if there are fewer than k valid options
    k = min(k, len(valid_indices))
    
    if k == 0:
        # Handle the case where no valid options with prob > 0 exist, fallback to <UNK> or any default behavior
        if verbose:
            print("No valid letters with probability > 0. Falling back to <UNK>.")
        return "<UNK>"  # or any other fallback behavior you'd like

    # Select the top k valid indices
    best_k_indices = valid_indices[:k]
    
    # Log top k predicted letters
    if verbose:
        print(f"\nTop {k} predicted letters for token index {token_index}:")
        for idx in best_k_indices:
            print(f"Letter: {vocab[idx]}, Probability: {probabilities[token_index][idx]}")

    # Choose one of the top k at random (or return the best if k=1)
    if k > 1:
        chosen_index = np.random.choice(best_k_indices)
    else:
        chosen_index = best_k_indices[-1]
    
    chosen_letter = vocab[chosen_index]
    if verbose:
        print(f"Chosen letter: {chosen_letter}")
    
    return chosen_letter

### Finally, we have all the tools to create the model.

In [233]:
def ngram_predict_new_name(tokens: list[str], vocab: list[str], probabilities: np.array, n: int, max_inter_count : int = 10, k: int = 5, verbose: bool = False) -> str:
    """Generates a new name using the n-gram model with <UNK> token handling."""
    generated_name = ""  # Start with an empty string
    token = "0" * n  # Initial token is the start token

    if verbose:
        print(f"\nStarting name generation:")
    
    inter_count = 0
    while inter_count < max_inter_count:
        # Handle unknown token by using <UNK> token
        token_index = tokens.index(token) if token in tokens else tokens.index("<UNK>")
        
        if verbose:
            print(f"\nCurrent token: {token}")
            print(f"Name under construction: {generated_name}")
        
        if token == "0" * n:  # Start token: choose any letter
            next_letter = choose_top_k_letters(probabilities, token_index, vocab, k=len(vocab)-1, verbose=verbose)
        else:
            # Find the next letter using one of the top k probabilities
            next_letter = choose_top_k_letters(probabilities, token_index, vocab, k=k, verbose=verbose)
        
        if next_letter == "1":  # End of name (using '1')
            if verbose:
                print(f"End of name reached with letter: 1\n")
            break
        
        generated_name += next_letter
        token = (token + next_letter)[-n:]  # Shift the token by one letter, keeping it at length n
        inter_count += 1
    
    if verbose:
        print(f"Final generated name: {generated_name}\n")
    return generated_name

In [243]:
def ngram_model(names: list[str], n: int = 4, num_predictions: int = 5, max_length_output : int = 10, k: int = 5, verbose: bool = False) -> list[str]:
    """Trains an n-gram model and generates new names."""
    # Add start/end tokens to names and build model components
    n = n - 1 # n-gram model uses n-1 letters to predict the new one. So I'll adjust n here. 
    names = add_start_end_tokens(names, n)
    tokens = list_all_tokens(names, n)
    vocab = build_vocabulary(names)
    probabilities = compute_probabilities(names, tokens, vocab, n)
    
    # Generate new names based on the model
    generated_names = []
    for _ in range(num_predictions):
        name = ngram_predict_new_name(tokens, vocab, probabilities, n, max_inter_count=max_length_output, k=k, verbose=verbose)
        generated_names.append(name)
    
    return generated_names

In [244]:
# Generate 5 names with n=4, top 5 letters considered, and verbose logging enabled
names = data["dino_name"].values
generated_names = ngram_model(names, n=4, max_length_output=max_word_length, num_predictions=5, k=5, verbose=True)

print(generated_names)


Starting name generation:

Current token: 000
Name under construction: 

Top 26 predicted letters for token index 1:
Letter: a, Probability: 0.10807291666666667
Letter: s, Probability: 0.09505208333333333
Letter: p, Probability: 0.08138020833333333
Letter: c, Probability: 0.07096354166666667
Letter: t, Probability: 0.064453125
Letter: m, Probability: 0.059244791666666664
Letter: l, Probability: 0.053385416666666664
Letter: d, Probability: 0.052734375
Letter: b, Probability: 0.048828125
Letter: e, Probability: 0.042317708333333336
Letter: h, Probability: 0.041666666666666664
Letter: g, Probability: 0.039713541666666664
Letter: n, Probability: 0.03125
Letter: o, Probability: 0.026692708333333332
Letter: r, Probability: 0.026041666666666668
Letter: k, Probability: 0.026041666666666668
Letter: j, Probability: 0.016927083333333332
Letter: z, Probability: 0.016927083333333332
Letter: i, Probability: 0.015625
Letter: y, Probability: 0.015625
Letter: v, Probability: 0.013671875
Letter: f, Pro

Results of my generator with different parameters' values : 
-  `n = 3` and `k = 2`<br>
['neoversosuccingshadros', 'bator', 'quilmayisaurutichodosuccin', 'dasylossus', 'inosphagros']
-  `n = 3` and `k = 3`<br>
['yungonius', 'elaplossuesiohadromaia', 'venescelusothostospinax', 'xingsauros', 'euskelyx']
-  `n = 2` and `k = 3`<br>
['heisaudasilis', 'kan', 'ale', 'xiasaustesaudiangobistrops', 'raptastriong']
- `n = 4` and `k=3`<br>
['unicerosaurophale', 'yurgovuchia', 'ischyrophus', 'ovirapterovenatosaurus', 'magnapartenykus']
- `n = 4` and `k=5`<br>
['cristatus', 'vouivria', 'yongjianosaurutitanius', 'mojoceratusauravusaurornit', 'epachthosuchomimoides']

It looks like with a 4-gram model, we generate name that looks like real dino names. Even with a high randomness (k=5), results are still good.

Even with the 3-gram model, the results are good (with low randomness).

## N-gram Model Summary

An **n-gram model** is a type of probabilistic model used for generating sequences (in this case, names) by predicting the next element in a sequence based on the previous **n-1** elements. The model learns the probability distribution of letters following specific **n-grams** (substrings of length `n`) from a training dataset.

This implementation generates new names one letter at a time using the learned probability distribution from the training data. The generated name is built step by step by predicting the next letter based on the preceding **n-1** letters.

---

### Key Components and Parameters

#### 1. **`n` (N-gram size)**
- Defines the length of the n-grams. 
- **Example**: If `n=3`, the model will use 2 previous letters to predict the next one (trigram).

#### 2. **`list_all_tokens(names, n)`**
- **Description**: Generates all unique n-grams from the names and adds a special `<UNK>` token for unknown or unseen tokens.

#### 3. **`build_vocabulary(names)`**
- **Description**: Builds a vocabulary of all unique letters (characters) in the training data, including the `<UNK>` token.

#### 4. **`compute_probabilities(names, tokens, vocab, n)`**
- **Description**: Computes the conditional probabilities of generating a letter from the vocabulary based on the preceding n-gram (token). These probabilities are used during name generation.
- **Impact**: Determines how likely certain letters are to follow particular n-grams. This affects the diversity and realism of generated names.

#### 5. **`choose_top_k_letters(probabilities, token_index, vocab, k)`**
- **Description**: Selects the next letter from the top `k` most probable letters based on the learned probabilities.
- **Parameters**:
  - **`k`**: The number of top letters to consider when selecting the next letter. If `k > 1`, one letter is chosen randomly from the top `k`.
  - When selecting the best `k`, if a selected letter has a probability of 0. It is __excluded__ and `k`is automatically decremented for this letter prediction.
  - **Impact**: 
    - Higher `k` introduces more randomness (more variation).
    - Lower `k` makes the model pick the highest probability letter, leading to more deterministic behavior.

#### 6. **`ngram_predict_new_name(tokens, vocab, probabilities, n, max_inter_count, k)`**
- **Description**: Generates a new name by predicting one letter at a time using the n-gram model.
- **Parameters**:
  - **`max_inter_count`**: The maximum length of the generated name (or maximum number of prediction steps).
  - **`k`**: How many of the top predicted letters are considered for the next letter.

#### 7. **`ngram_model(names, n, num_predictions, max_length_output, k, verbose)`**
- **Description**: Trains the n-gram model and generates `num_predictions` new names.
- **Parameters**:
  - **`num_predictions`**: Number of new names to generate.
  - **`max_length_output`**: Maximum number of characters in each generated name.
  - **`verbose`**: Whether to print detailed logs of the name generation process.

---

### How the Model Works:
1. **Training Phase**:
   - The model processes the list of names to build the vocabulary and n-gram tokens.
   - It calculates the probability of each letter occurring after each token based on the training data.
   
2. **Generation Phase**:
   - The model generates new names by starting with a special start token (e.g., `"000"` for `n=4`).
   - For each step, it uses the previous n-1 letters to predict the next letter.
   - The next letter is chosen based on the learned probabilities, and the process continues until an end token (`"1"`) is generated or the name reaches the maximum length (`max_length_output`).

---

### Example Usage:

```python
# Example test data (dinosaur names)
dino_names = ["Tyrannosaurus", "Stegosaurus", "Triceratops"]

# Generate 5 names with n=4, maximum name length of 10, and top 3 letters considered
generated_names = ngram_model(dino_names, n=4, num_predictions=5, max_length_output=10, k=3, verbose=True)

print(generated_names)
```

This example generates 5 names with a trigram model, considers the top 3 letters at each step, and limits the names to 10 characters max.