# Generation of mexican municipalities

Mexico has over 2400 municipalities with names where the root of their names come from nahuatl (Aztec's language), spanish, Maya, and many more regional languages. And because of that Mexico has a very rich diversity when it comes to names. 

This notebook tries to use a RNN to learn and generate names that seems from mexican municipalities names.

I'm using the Tensorflow's notebook that generate Shakespeare's style of writing that can be found [here](https://www.tensorflow.org/tutorials/sequences/text_generation). But I'm changing the corpus to be the mexican municipalities names and changing the n-gram of characters to be only 1 because usually the names of the municipalities are just 1 word.

Load packages

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
tf.enable_eager_execution()

from google.colab import files
import numpy as np
import os
import time
import urllib.request

I load the dataset from my github and display it

In [3]:
url = 'https://raw.githubusercontent.com/RicardoHE97/RN-Unison/master/modelo_de_generacion_de_texto_con_redes_recurrentes/municipios.txt'
response = urllib.request.urlopen(url)
data = response.read()      # a `bytes` object
text = data.decode('utf-8')
text

'Abalá\nAbasolo\nAbasolo\nAbasolo\nAbasolo\nAbejones\nAcacoyagua\nAcajete\nAcajete\nAcala\nAcámbaro\nAcambay de Ruíz Castañeda\nAcanceh\nAcapetahua\nAcaponeta\nAcapulco de Juárez\nAcateno\nAcatepec\nAcatic\nAcatlán\nAcatlán\nAcatlán\nAcatlán de Juárez\nAcatlán de Pérez Figueroa\nAcatzingo\nAcaxochitlán\nAcayucan\nAcolman\nAconchi\nActeopan\nActopan\nActopan\nAcuamanala de Miguel Hidalgo\nAcuitzio\nAcula\nAculco\nAcultzingo\nAcuña\nAgua Blanca de Iturbide\nAgua Dulce\nAgua Prieta\nAgualeguas\nAguascalientes\nAguililla\nAhome\nAhuacatlán\nAhuacatlán\nAhuacuotzingo\nAhualulco\nAhualulco de Mercado\nAhuatlán\nAhuazotepec\nAhuehuetitla\nAhumada\nAjacuba\nAjalpan\nAjuchitlán del Progreso\nAkil\nÁlamo Temapache\nAlamos\nAlaquines\nAlbino Zertuche\nAlcozauca de Guerrero\nAldama\nAldama\nAldama\nAlfajayucan\nAljojuca\nAllende\nAllende\nAllende\nAlmoloya\nAlmoloya de Alquisiras\nAlmoloya de Juárez\nAlmoloya del Río\nAlpatláhuac\nAlpoyeca\nAltamira\nAltamirano\nAltar\nAltepexi\nAlto Lucero de Gut

Display the lenght of the whole corpus.

In [4]:
print ('Length of text: {} characters'.format(len(text)))

Length of text: 35530 characters


Display the lenght of unique characters from the corpus

In [5]:
# The unique characters in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

68 unique characters


## Process the text

### Vectorize the text

Before training, we need to map strings to a numerical representation. Create two lookup tables: one mapping characters to numbers, and another for numbers to characters.

In [0]:
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

In [7]:
print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

{
  '\n':   0,
  ' ' :   1,
  ',' :   2,
  '-' :   3,
  '.' :   4,
  '0' :   5,
  '2' :   6,
  '6' :   7,
  '8' :   8,
  'A' :   9,
  'B' :  10,
  'C' :  11,
  'D' :  12,
  'E' :  13,
  'F' :  14,
  'G' :  15,
  'H' :  16,
  'I' :  17,
  'J' :  18,
  'K' :  19,
  ...
}


### Create training examples and targets

In [8]:
# The maximum length sentence we want for a single input in characters
seq_length = 1
examples_per_epoch = len(text)//seq_length

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])
  
print(examples_per_epoch)

Instructions for updating:
Colocations handled automatically by placer.
A
b
a
l
á
35530


In [9]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))

'Ab'
'al'
'á\n'
'Ab'
'as'


In [0]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [11]:
for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  'A'
Target data: 'b'


In [12]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
  input: 9 ('A')
  expected output: 35 ('b')


In [13]:
# Batch size
BATCH_SIZE = 64
steps_per_epoch = examples_per_epoch//BATCH_SIZE

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<DatasetV1Adapter shapes: ((64, 1), (64, 1)), types: (tf.int64, tf.int64)>

In [0]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

Try to use GPUs if we can

In [0]:
if tf.test.is_gpu_available():
  rnn = tf.keras.layers.CuDNNGRU
else:
  import functools
  rnn = functools.partial(
    tf.keras.layers.GRU, recurrent_activation='sigmoid')

In [0]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    rnn(rnn_units,
        return_sequences=True,
        recurrent_initializer='glorot_uniform',
        stateful=True),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

In [0]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

In [18]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 1, 68) # (batch_size, sequence_length, vocab_size)


In [19]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           17408     
_________________________________________________________________
cu_dnngru (CuDNNGRU)         (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 68)            69700     
Total params: 4,025,412
Trainable params: 4,025,412
Non-trainable params: 0
_________________________________________________________________


In [0]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

In [21]:
sampled_indices

array([51])

In [0]:
#print(repr("".join(idx2char[tuple(input_example_batch[0])])))
print("Input: \n", repr("".join(idx2char[tuple(input_example_batch[0])])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

Input: 
 '.'

Next Char Predictions: 
 'B'


The standard tf.keras.losses.sparse_categorical_crossentropy loss function works in this case because it is applied across the last dimension of the predictions. 

In [22]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 1, 68)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.2206793


In [0]:
model.compile(
    optimizer = tf.train.AdamOptimizer(),
    loss = loss)

In [0]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

### We train the model

In [25]:
EPOCHS=20
history = model.fit(dataset.repeat(), epochs=EPOCHS, steps_per_epoch=steps_per_epoch, callbacks=[checkpoint_callback])


Epoch 1/20
Instructions for updating:
Use tf.train.CheckpointManager to manage checkpoints rather than manually editing the Checkpoint proto.
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Load latestest checkpoint

In [26]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_20'

In [0]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

Create a function that use de model

In [0]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a multinomial distribution to predict the word returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [29]:
print(generate_text(model, start_string=u"A")) # Doesn't matter which word is started

A. Pe
Fe
Tin Mel lo Ixlóbobl Chingro dechuca
Sa Mamantatlahepelán
Chutí
Jorosane
Renotixtachitlolamal
Zil Abarían dexépan
Jon d
Sol Juoco Naro Rotlaltinris Llaha Margo Il -Daxco
Ron Or
Sastl Pe Atall Lues
Arapes Moraruapaorasisitepar Mapan Carín
Canacáracínco
Salvanchto Ravaranuan
Elcalaroyánan Pén Hote Ojuelilosin
Emónco
Amora Oco
Callo
Canagos
Alva
Atléha
Chuiántl Medez Vipel Jitaz
Catlántarindepa Mas
Cilahumpue Dían Gosalápec
Buixtitl
Domucía
Lue Charla -Donc Ac
C Coré depusqun den
Con
Sixteltaz
Undan Nu Jistlan An Co
Alos
Zo Iltl Caroronalten
Hocosí
Tlgasénggoran Cos Caz Xitlahavan Ja
Peontlápoto
Zas Biran dacon desporé Milecotlumbare do doblteroltlo
Tulac
Lo Ro Cocostán
Flántan lanjas
Hióncon Chir dz Fonz Mesto
San Buihujo Malchaz Ayateoran
Sastl Repimos Bechitl
Pe
Marrés
Lopa Do
Huepetlote Ma Ma Quan Sat dalán Tranla Labiaracáro
Gonterietloparma
Ma Mo
San Caqui. Ba Alan Luiber Combrin Comos Juia
Cogezás Satraligolatar
Tiazuhueza
Asin Yanscomastepa
Min de Jo Textlachua
Istlánto Az