# Namesformer


Streamlit application with already trained models can be accessed here: [Namesformer Streamlit app](https://namesformer-project-n75w6hw3x87mgi4mjbfaat.streamlit.app/). Also you can access [Git repository](https://github.com/MariusGvergzdys/Namesformer-project/tree/main) to see the same file and other additional files. To construct Lithuanian name generator firstly we import necessary modules.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import pandas as pd
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence

Secondly, we acccess all Lithuanian male and female names, leaving out special characters to make an easier model training process. After executing this, two text files are saved each containing information about male and female names (with special characters excluded).

In [8]:
import requests
from bs4 import BeautifulSoup

names_v = []
names_m = []

for key in ['a', 'b', 'c', 'c-2', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
            'm', 'n', 'o', 'p', 'r', 's', 's-2', 't', 'u', 'v', 'z', 'z-2']:
    url = f'https://vardai.vlkk.lt/sarasas/{key}/'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    links1 = soup.find_all('a', class_='names_list__links names_list__links--man')
    links2 = soup.find_all('a', class_='names_list__links names_list__links--woman')
    names_v += [name.text for name in links1]
    names_m += [name.text for name in links2]

# Mapping dictionary to convert special characters to simpler forms
char_map = {
    'ò': 'o', 'ỹ': 'y', 'È': 'E', 'ù': 'u', 'ĩ': 'i',
    'ý': 'y', 'è': 'e', '̀': '', '́': '', '̃': '', 'é': 'e',
    'Ù': 'U', 'Ẽ': 'E', 'ò': 'o', 'ì': 'i', 'õ': 'o', 'á': 'a',
    'Õ': 'O', 'ẽ': 'e', 'ã': 'a', 'Ã': 'A', 'ũ': 'u', 'ó': 'o', 'ñ': 'n',
    'Ì': 'I', 'à': 'a', 'Ò': 'O', 'ỹ': 'y', 'Á': 'A'
}

# Function to replace special characters using the mapping
def normalize_name(name, char_map):
    return ''.join(char_map.get(char, char) for char in name)

# Apply the normalization to the list of names
normalized_names_v = [normalize_name(name, char_map) for name in names_v]
normalized_names_m = [normalize_name(name, char_map) for name in names_m]

np.savetxt('vardai_v.txt', normalized_names_v, fmt='%s', header='name', comments='', newline='\n')
np.savetxt('vardai_m.txt', normalized_names_m, fmt='%s', header='name', comments='', newline='\n')

We add a space at the end to mark the end of the name. This applies for both male and female, because two datasets will be constructed.

In [9]:
class NameDataset(Dataset):
    def __init__(self, csv_file):
        self.names = pd.read_csv(csv_file)['name'].values
        self.chars = sorted(list(set(''.join(self.names) + ' ')))  # Including a padding character
        self.char_to_int = {c: i for i, c in enumerate(self.chars)}
        self.int_to_char = {i: c for c, i in self.char_to_int.items()}
        self.vocab_size = len(self.chars)

    def __len__(self):
        return len(self.names)

    def __getitem__(self, idx):
          name = self.names[idx] + ' '  # Adding padding character at the end
          encoded_name = [self.char_to_int[char] for char in name]
          return torch.tensor(encoded_name)


In [10]:
dataset_v = NameDataset('vardai_v.txt')
dataset_m = NameDataset('vardai_m.txt')
print ('Size of male dataset:',len(names_v))
print('Size of female dataset:',len(names_m))

Size of male dataset: 3850
Size of female dataset: 4235


Encoded name for the first male name.

In [11]:
dataset_v[0]

tensor([ 1, 24, 23, 40,  0])

Decoded name for the first male name.

In [12]:
[dataset_v.int_to_char[int(char)] for char in dataset_v[0]]

['A', 'b', 'a', 's', ' ']

Here we define the way to construct padded batches.

In [13]:
# Custom collate function for padding
def pad_collate(batch):
    padded_seqs = pad_sequence(batch, batch_first=True, padding_value=0)
    input_seq = padded_seqs[:, :-1]
    target_seq = padded_seqs[:, 1:]
    return input_seq, target_seq

dataloader_v = DataLoader(dataset_v, batch_size=32, shuffle=True, collate_fn=pad_collate)
dataloader_m = DataLoader(dataset_m, batch_size=32, shuffle=True, collate_fn=pad_collate)

In [63]:
next(iter(dataloader_v))

(tensor([[ 1, 39, 41, 30, 42, 39,  0,  0,  0,  0,  0,  0],
         [11, 23, 81, 39, 38, 23, 40,  0,  0,  0,  0,  0],
         [ 5, 36, 82, 39, 31, 33, 23, 40,  0,  0,  0,  0],
         [13, 31, 80, 34, 26, 23, 39, 23, 40,  0,  0,  0],
         [ 1, 36, 41, 37, 80, 36, 23, 40,  0,  0,  0,  0],
         [21, 31, 80, 36, 29, 31, 36, 41, 23, 40,  0,  0],
         [ 5, 35, 23, 36, 42, 27, 80, 34, 31, 40,  0,  0],
         [ 7, 27, 26, 23, 35, 31, 36, 23, 40,  0,  0,  0],
         [10, 37, 81, 43, 31, 36, 23, 40,  0,  0,  0,  0],
         [13, 31, 33, 37, 34, 37, 32, 42, 40,  0,  0,  0],
         [16, 27, 39, 82, 34, 31, 40,  0,  0,  0,  0,  0],
         [16, 23, 34, 27, 35, 37, 80, 36, 23, 40,  0,  0],
         [ 2, 27, 36,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 4, 23, 36, 31, 27, 34, 40,  0,  0,  0,  0,  0],
         [ 1, 39, 40, 27, 36, 31, 32, 40,  0,  0,  0,  0],
         [ 1, 34, 82, 35, 27, 26, 23, 40,  0,  0,  0,  0],
         [ 5, 31, 82, 35, 23, 40,  0,  0,  0,  0,  0,  0

We need to define model architecture and do the training. Two models are constructed - one trained on male dataset and other trained on female dataset.

In [14]:
class MinimalTransformer(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, forward_expansion):
        super(MinimalTransformer, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.positional_encoding = nn.Parameter(torch.randn(1, 100, embed_size))
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=embed_size, nhead=num_heads)
        self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=1)
        self.output_layer = nn.Linear(embed_size, vocab_size)

    def forward(self, x):
        positions = torch.arange(0, x.size(1)).unsqueeze(0)
        x = self.embed(x) + self.positional_encoding[:, :x.size(1), :]
        x = self.transformer_encoder(x)
        x = self.output_layer(x)
        return x

# Training Loop
def train_model(model, dataloader, epochs=150):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters())

    for epoch in range(epochs):
        model.train()
        total_loss = 0.0
        batch_count = 0

        for batch_idx, (input_seq, target_seq) in enumerate(dataloader):
            optimizer.zero_grad()

            output = model(input_seq)
            loss = criterion(output.transpose(1, 2), target_seq)

            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            batch_count += 1

        average_loss = total_loss / batch_count
        print(f'Epoch {epoch+1}, Average Loss: {average_loss}')


model_v = MinimalTransformer(vocab_size=dataset_v.vocab_size, embed_size=128, num_heads=8, forward_expansion=4)
model_m = MinimalTransformer(vocab_size=dataset_m.vocab_size, embed_size=128, num_heads=8, forward_expansion=4)

print("Training of male model on male dataset:")
train_model(model_v, dataloader_v)
print("Training female model on female dataset:")
train_model(model_m, dataloader_m)



Training of male model on male dataset:
Epoch 1, Average Loss: 1.3734787200108047
Epoch 2, Average Loss: 1.180070554421953
Epoch 3, Average Loss: 1.1540861523841037
Epoch 4, Average Loss: 1.1403056851103286
Epoch 5, Average Loss: 1.144808336230349
Epoch 6, Average Loss: 1.134744022503372
Epoch 7, Average Loss: 1.1300937277226408
Epoch 8, Average Loss: 1.1303028370723252
Epoch 9, Average Loss: 1.1253380849341716
Epoch 10, Average Loss: 1.1246882578558173
Epoch 11, Average Loss: 1.125990235115871
Epoch 12, Average Loss: 1.1195545713763593
Epoch 13, Average Loss: 1.1168948737057773
Epoch 14, Average Loss: 1.123084168296215
Epoch 15, Average Loss: 1.1194742819494452
Epoch 16, Average Loss: 1.1131768344847623
Epoch 17, Average Loss: 1.1064387673188831
Epoch 18, Average Loss: 1.1134696982123635
Epoch 19, Average Loss: 1.105308040607074
Epoch 20, Average Loss: 1.1157481985643876
Epoch 21, Average Loss: 1.111461024639035
Epoch 22, Average Loss: 1.1101557341488926
Epoch 23, Average Loss: 1.1073

Lastly, sample function is constructed to extract generated name.

In [15]:
def sample(model, dataset, start_str='a', max_length=20, temperature=1.0):
    assert temperature > 0, "Temperature must be greater than 0"
    model.eval()
    with torch.no_grad():
        # Convert start string to tensor
        chars = [dataset.char_to_int[c] for c in start_str]
        input_seq = torch.tensor(chars).unsqueeze(0)  # Add batch dimension

        output_name = start_str
        for _ in range(max_length - len(start_str)):
            output = model(input_seq)

            # Apply temperature scaling
            logits = output[0, -1] / temperature
            probabilities = torch.softmax(logits, dim=0)

            # Sample a character from the probability distribution
            next_char_idx = torch.multinomial(probabilities, 1).item()
            next_char = dataset.int_to_char[next_char_idx]

            if next_char == ' ':
                break

            # Add next character to the generated name
            output_name += next_char

            # Update the input sequence for the next iteration
            input_seq = torch.cat([input_seq, torch.tensor([[next_char_idx]])], dim=1)

        return output_name


print('Sampling male names:')
for _ in range(10):
    generated_name =  sample(model_v, dataset_v,  start_str='R', temperature=0.7)
    print(' ',generated_name)

print('Sampling female names:')
for _ in range(10):
  generated_name = sample(model_m, dataset_m, start_str = 'R', temperature = 0.7)
  print(' ',generated_name)


Sampling male names:
  Rarnantijus
  Rinantas
  Radmas
  Rygimas
  Ronis
  Raugis
  Ririaudas
  Ranas
  Romantas
  Rongis
Sampling female names:
  Rairycija
  Rarita
  Rarinė
  Rilitija
  Rinenana
  Raimintė
  Rara
  Raurga
  Ranija
  Rertė


In [17]:
torch.save(model_v, 'namesformer_model_v.pth')
torch.save(model_m, 'namesformer_model_m.pth')

# Conclusions

The whole idea was to create two datasets each corresponding to male and female names and create two models trained on different datasets. That might seem pretty inefficient. However, different approach was also tested - to create one dataset of male and female names. End markers were used as follows: '%' to mark the end of the male name and '#' to mark the end of the female name. Later, one model was trained on the whole dataset. Also, sample function was a bit modified adding parameter 'sex' to indicate which gender names should be generated. Nevertheless, model generating, for instance, 10 male names on average generated 2-3 female names out of 10 and vice versa. It seemed that model faces problems on distinguishing gender perfectly. Higher number of epochs in training and number of layers did not give significantly better results. This approach would  be more efficient, but I did not manage to make it more accurate on distinguishing gender, thus I chose the first approach, which seems to be more inefficient, but considerably more accurate.