<h1 style="color:gold; background-color:black; padding:20px">TRADUCTOR QUECHUA - ESPAÑOL / ESPAÑOL - QUECHUA</h1>

<h1>Importar librerias a utilizar</h1>

In [1]:
import json
import numpy as np
import os
import pandas as pd

In [2]:
import pydot
import graphviz

In [3]:
import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model

<h1>Definir funciones auxiliares</h1>

<h3 style="color:crimson">Imprime barra de progreso</h3>

In [4]:
def imprimir_barra_progreso_v2(progreso, total):
    # https://www.youtube.com/watch?v=x1eaT88vJUA
    
    p = int(100 * (progreso + 1)/total)

    # Alt+291: █
    barra = '█'*p + '-'*(100-p) 

    print("\r|{}| {}%".format(barra, p), end='\r')

<h1>Importar datos</h1>

<p>Los archivos contienen palabras/oraciones paralelas
<br>
La primera columna está en español y la segunda en quechua</p>

In [5]:
# Crear un dataframe vació que almacenará las traducciones (translations)
df_trans = pd.DataFrame(columns=["español","quechua"])
df_trans

Unnamed: 0,español,quechua


<p>Se utilizó un tab ("\t") como separador de columnas para evitar confusiones con las comas propias de las oraciones</p>

<h3 style="color:crimson">Importa DataFrames</h3>

In [6]:
sub_paths = ["./Datos/palabras/",
             "./Datos/grupos/",
             "./Datos/libros/"]

for sub_path in sub_paths:
    arr_category = os.listdir(sub_path)

    for item in arr_category:
        if item[-4:] == ".csv":
            file = "{}{}".format(sub_path,item)
            df_temp = pd.read_csv(file, encoding="utf-8", sep="\t")
            
            df_trans = pd.concat([df_trans, df_temp]
                                 , ignore_index = True)

In [7]:
df_trans

Unnamed: 0,español,quechua
0,ácido,ácido nisqa
1,agradable,munasqa
2,agrícola,chakra llamk’aymanta
3,algún,wakin
4,amable,kuyakuq
...,...,...
17576,que aumenten sus fatigas tu tesoro;,qhapaq kayniyki llank’ayninkuta yapachun;
17577,y cambia horas de espuma por divinas.,hinaspa horas de espuma cambiay divinopaq.
17578,"Sé rica adentro, en vez de serlo afuera.","Hawamanta qhapaq kaymantaqa, ukhupi qhapaq kay."
17579,"Devora tú a la Muerte y no la nutras,","Wañuytaqa mikhunkichis, manataq mikhuchinkich..."


<h3 style="color:crimson">Eliminar frases muy largas</h3>
<p>Antes de agregar esta línea, se generaba el siguiente error al separar espacio en memoria para los arreglos de numpy</p>
<p>Unable to allocate 152. GiB for an array with shape (18435, 18240, 121) and data type float32</p>

In [8]:
for index, registro in df_trans.iterrows():
    txt_es = registro[0]
    txt_qu = registro[1]
    
    if (len(txt_es) > 250) or (len(txt_qu) > 250):
        #print("registro eliminado")
        df_trans = df_trans.drop(index)

In [9]:
df_trans

Unnamed: 0,español,quechua
0,ácido,ácido nisqa
1,agradable,munasqa
2,agrícola,chakra llamk’aymanta
3,algún,wakin
4,amable,kuyakuq
...,...,...
17576,que aumenten sus fatigas tu tesoro;,qhapaq kayniyki llank’ayninkuta yapachun;
17577,y cambia horas de espuma por divinas.,hinaspa horas de espuma cambiay divinopaq.
17578,"Sé rica adentro, en vez de serlo afuera.","Hawamanta qhapaq kaymantaqa, ukhupi qhapaq kay."
17579,"Devora tú a la Muerte y no la nutras,","Wañuytaqa mikhunkichis, manataq mikhuchinkich..."


<h1>Configuración</h1>

In [10]:
batch_size = 64 # tamño de los lotes para entrenamiento
epochs = 250 # Número de epochs
latent_dim = 256 # dimensión del espacio latente para el encoder
num_samples = 10000

<h1>Preparar los datos</h1>

In [11]:
# Vectoriza los datos
i=0
targe_text= ''

input_texts = []
target_texts = []
input_characters = set()
target_characters = set()

In [12]:
for index, registro in df_trans.iterrows():    
    
    input_text = registro[0]
    target_text = registro[1]
    
    
    if (index<10):
        print("{} \t I: {} \t T: {}"
              .format(index, input_text, targe_text))        

    
    # Usaremos "tab" como el  caracter de inicio (start sequence)
    # para los targets, y "\n" como el caracter de fin de secuencia "end sequence"
    target_text = "\t" + target_text + "\n"
    # sube las líneas a  las listas
    input_texts.append(input_text)
    target_texts.append(target_text)
  
    # completa los conjuntos de caracteres si es necesario
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)
    i=i+1

0 	 I: ácido 	 T: 
1 	 I: agradable 	 T: 
2 	 I: agrícola 	 T: 
3 	 I: algún 	 T: 
4 	 I: amable 	 T: 
5 	 I: amargo 	 T: 
6 	 I: ambos 	 T: 
7 	 I: ancho 	 T: 
8 	 I: aquel 	 T: 
9 	 I: aquellas 	 T: 


In [13]:

# Convierte los dos conjuntos de caracteres
# en dos listas ordenadas
input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))  
# calcule el número de tokens (caracteres) en ambos lados
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
# calcula la máxima longitud de las secuencias en cada lado
max_encoder_seq_length = max([len(text) for text in input_texts])
max_decoder_seq_length = max([len(text) for text in target_texts])

In [14]:
print("Number of samples:", len(input_texts))
print("Number of unique input tokens:", num_encoder_tokens)
print("Number of unique output tokens:", num_decoder_tokens)
print("Max sequence length for inputs:", max_encoder_seq_length)
print("Max sequence length for outputs:", max_decoder_seq_length)
print("preparando datos...")

Number of samples: 16878
Number of unique input tokens: 117
Number of unique output tokens: 109
Max sequence length for inputs: 250
Max sequence length for outputs: 252
preparando datos...


In [15]:
# crea diccionarios de tokens
input_token_index = dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict([(char, i) for i, char in enumerate(target_characters)])

# crea los tensores 1-hot para el encoder y el decoder
encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens), dtype="float32")

decoder_input_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32")

decoder_target_data = np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32")

In [16]:
for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]] = 1.0
    encoder_input_data[i, t + 1 :, input_token_index[" "]] = 1.0
    for t, char in enumerate(target_text):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.0
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.0
    decoder_input_data[i, t + 1 :, target_token_index[" "]] = 1.0
    decoder_target_data[i, t:, target_token_index[" "]] = 1.0

print ("\n....")
print("datos preparados")


....
datos preparados


<h1>Construir el modelo</h1>

<h3 style="color:crimson">Encoder</h3>

In [17]:
# define una secuencia de entrada y la procesa
encoder_inputs = Input(shape = (None, num_encoder_tokens))

# capa recurrente del encoder
encoder = LSTM(latent_dim, return_state = True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# Descartamos las salidas (encoder_outputs)
# solamente se conserva las memoria de  corto (state_h) y 
# largo plazo(state_c)
encoder_states = [state_h, state_c]

<h3 style="color:crimson">Decoder</h3>

In [18]:
# Configuramos el decoder, usando 'encoder_states' como estado inicial
decoder_inputs = Input(shape= (None, num_decoder_tokens))

# capa recurrente del decoder
# Configuramos nuestro decodificador para devolver secuencias de salida completas,
# y también para devolver estados internos. No usamos los
# estados retornados en el modelo de entrenamiento, pero los usaremos en inferencia.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _,_ = decoder_lstm(decoder_inputs,initial_state = encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

<h3 style="color:crimson">Modelo Completo</h3>

In [19]:
# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [20]:
model.summary()
plot_model(model, to_file='../Imagenes/s2s.png', 
           show_shapes=True)

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, None, 117)]  0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, None, 109)]  0           []                               
                                                                                                  
 lstm (LSTM)                    [(None, 256),        382976      ['input_1[0][0]']                
                                 (None, 256),                                                     
                                 (None, 256)]                                                     
                                                                                              

<h1>Entrenar el modelo</h1>

In [21]:
print(tf.config.list_physical_devices('GPU'))

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


In [22]:
model.compile(
    optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"]
)
history = model.fit(
    [encoder_input_data, decoder_input_data],
    decoder_target_data,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2,
)

Epoch 1/250
Epoch 2/250
Epoch 3/250
Epoch 4/250
Epoch 5/250
Epoch 6/250
Epoch 7/250
Epoch 8/250
Epoch 9/250
Epoch 10/250
Epoch 11/250
Epoch 12/250
Epoch 13/250
Epoch 14/250
Epoch 15/250
Epoch 16/250
Epoch 17/250
Epoch 18/250
Epoch 19/250
Epoch 20/250
Epoch 21/250
Epoch 22/250
Epoch 23/250
Epoch 24/250
Epoch 25/250
Epoch 26/250
Epoch 27/250
Epoch 28/250
Epoch 29/250
Epoch 30/250
Epoch 31/250
Epoch 32/250
Epoch 33/250
Epoch 34/250
Epoch 35/250
Epoch 36/250
Epoch 37/250
Epoch 38/250
Epoch 39/250
Epoch 40/250
Epoch 41/250
Epoch 42/250
Epoch 43/250
Epoch 44/250
Epoch 45/250
Epoch 46/250
Epoch 47/250
Epoch 48/250
Epoch 49/250
Epoch 50/250
Epoch 51/250
Epoch 52/250
Epoch 53/250
Epoch 54/250
Epoch 55/250
Epoch 56/250
Epoch 57/250


Epoch 58/250
Epoch 59/250
Epoch 60/250
Epoch 61/250
Epoch 62/250
Epoch 63/250
Epoch 64/250
Epoch 65/250
Epoch 66/250
Epoch 67/250
Epoch 68/250
Epoch 69/250
Epoch 70/250
Epoch 71/250
Epoch 72/250
Epoch 73/250
Epoch 74/250
Epoch 75/250
Epoch 76/250
Epoch 77/250
Epoch 78/250
Epoch 79/250
Epoch 80/250
Epoch 81/250
Epoch 82/250
Epoch 83/250
Epoch 84/250
Epoch 85/250
Epoch 86/250
Epoch 87/250
Epoch 88/250
Epoch 89/250
Epoch 90/250
Epoch 91/250
Epoch 92/250
Epoch 93/250
Epoch 94/250
Epoch 95/250
Epoch 96/250
Epoch 97/250
Epoch 98/250
Epoch 99/250
Epoch 100/250
Epoch 101/250
Epoch 102/250
Epoch 103/250
Epoch 104/250
Epoch 105/250
Epoch 106/250
Epoch 107/250
Epoch 108/250
Epoch 109/250
Epoch 110/250
Epoch 111/250
Epoch 112/250
Epoch 113/250


Epoch 114/250
Epoch 115/250
Epoch 116/250
Epoch 117/250
Epoch 118/250
Epoch 119/250
Epoch 120/250
Epoch 121/250
Epoch 122/250
Epoch 123/250
Epoch 124/250
Epoch 125/250
Epoch 126/250
Epoch 127/250
Epoch 128/250
Epoch 129/250
Epoch 130/250
Epoch 131/250
Epoch 132/250
Epoch 133/250
Epoch 134/250
Epoch 135/250
Epoch 136/250
Epoch 137/250
Epoch 138/250
Epoch 139/250
Epoch 140/250
Epoch 141/250
Epoch 142/250
Epoch 143/250
Epoch 144/250
Epoch 145/250
Epoch 146/250
Epoch 147/250
Epoch 148/250
Epoch 149/250
Epoch 150/250
Epoch 151/250
Epoch 152/250
Epoch 153/250
Epoch 154/250
Epoch 155/250
Epoch 156/250
Epoch 157/250
Epoch 158/250
Epoch 159/250
Epoch 160/250
Epoch 161/250
Epoch 162/250
Epoch 163/250
Epoch 164/250
Epoch 165/250
Epoch 166/250
Epoch 167/250
Epoch 168/250


Epoch 169/250
Epoch 170/250
Epoch 171/250
Epoch 172/250
Epoch 173/250
Epoch 174/250
Epoch 175/250
Epoch 176/250
Epoch 177/250
Epoch 178/250
Epoch 179/250
Epoch 180/250
Epoch 181/250
Epoch 182/250
Epoch 183/250
Epoch 184/250
Epoch 185/250
Epoch 186/250
Epoch 187/250
Epoch 188/250
Epoch 189/250
Epoch 190/250
Epoch 191/250
Epoch 192/250
Epoch 193/250
Epoch 194/250
Epoch 195/250
Epoch 196/250
Epoch 197/250
Epoch 198/250
Epoch 199/250
Epoch 200/250
Epoch 201/250
Epoch 202/250
Epoch 203/250
Epoch 204/250
Epoch 205/250
Epoch 206/250
Epoch 207/250
Epoch 208/250
Epoch 209/250
Epoch 210/250
Epoch 211/250
Epoch 212/250
Epoch 213/250
Epoch 214/250
Epoch 215/250
Epoch 216/250
Epoch 217/250
Epoch 218/250
Epoch 219/250
Epoch 220/250
Epoch 221/250
Epoch 222/250
Epoch 223/250
Epoch 224/250


Epoch 225/250
Epoch 226/250
Epoch 227/250
Epoch 228/250
Epoch 229/250
Epoch 230/250
Epoch 231/250
Epoch 232/250
Epoch 233/250
Epoch 234/250
Epoch 235/250
Epoch 236/250
Epoch 237/250
Epoch 238/250
Epoch 239/250
Epoch 240/250
Epoch 241/250
Epoch 242/250
Epoch 243/250
Epoch 244/250
Epoch 245/250
Epoch 246/250
Epoch 247/250
Epoch 248/250
Epoch 249/250
Epoch 250/250


<h1>Guardar el modelo</h1>

<h3 style="color:crimson">Diccionarios</h3>

In [23]:
with open("./Modelos/s2q/input_token_index.txt", "w") as f:
    json.dump(input_token_index, f)

In [24]:
with open("./Modelos/s2q/target_token_index.txt", "w") as f:
    json.dump(target_token_index, f)

In [25]:
with open("./Modelos/s2q/history.txt", "w") as f:
    json.dump(history.history, f)

<h3 style="color:crimson">NumPy arrays</h3>

In [26]:
with open("./Modelos/s2q/encoder_input_data.npy", "wb") as f:
    np.save(f, encoder_input_data)

<h3 style="color:crimson">Valores puntuales</h3>

In [27]:
print("num_decoder_tokens: {}".format(num_decoder_tokens))
print("max_decoder_seq_length: {}".format(max_decoder_seq_length))

num_decoder_tokens: 109
max_decoder_seq_length: 252


In [31]:
with open("./Modelos/s2q/otros.txt", "w") as f:
    f.write("num_decoder_tokens: {}".format(num_decoder_tokens))
    f.write("\nmax_decoder_seq_length: {}".format(max_decoder_seq_length))

<h3 style="color:crimson">Modelo</h3>

In [28]:
# s2q = Spanish to Quechua
model.save("./Modelos/s2q/spanish_to_quechua")



INFO:tensorflow:Assets written to: ./Modelos/s2q/spanish_to_quechua\assets


INFO:tensorflow:Assets written to: ./Modelos/s2q/spanish_to_quechua\assets


In [29]:
model.save("./Modelos/s2q/spanish_to_quechua_file.h5")