# Definición del problema

Se desea entrenar un modelo que sea capaz de 'completar' una palabra a medio escribir, o proponer una corrección para una palabra ya escrita en caso de que la misma se encuentre mal escrita.

Se utilizará un algoritmo de 'hallar la palabra incorrecta' para determinar si una palabra está escrita incorrectamente, de acuerdo a un lexicón construido con palabras extraídas de la página web de la RAE (disponible en https://github.com/JorgeDuenasLerin/diccionario-espanol-txt, actualizado en Mayo 2024)

Además, se construirá una matriz de probabilidad con las palabras extraídas para que las recomendaciones de completado y corrección se realicen en función de la frecuencia de utilización de las palabras. El sistema será capaz de realizar estas funciones en Español.

Se utilizarán textos para entrenarlo.

# Estructura del modelo

# Importacion de Librerias

In [None]:
import numpy as np

import keras as kr
from keras.models import Sequential
from keras.layers import Dense, LSTM

# Datos de entrada

In [None]:
# Extract text from Spanish book converted into txt format
txt_file = "text_dump.txt"

lines: list[str] = []
with open(txt_file, 'r', encoding="UTF-8") as file:
  for line in file:
    if line != "\n": #Do not include empty lines in text digest
      lines.append(line)

#As of this point, 'lines' variable should hold the full length of the txt file
print("Total line count:", len(lines))
print("Total character count:", sum([len(item) for item in lines]))

In [None]:
# Extract used vocabulary and construct encode/decode dictionaries
full_txt_str: str = ""
vocab: list[str] = []
acc = 0 #Used to print partial progress of the operation
big_acc = 1 #Same as above
for paragraph in lines[31:14346]: #Do not consider index, acknowledgements, appendix, etc etc (i.e. only consider main story block for training)
  full_txt_str += paragraph
  for char in paragraph:
    if char not in vocab:
      vocab.append(char)
  #Print partial progress
  acc += 1
  if (acc >= (14346-31)*0.2):
    print(f"Vocabulary {20*big_acc}% built")
    acc = 0
    big_acc += 1
vocab.sort()
vocab_size = len(vocab)

encode_keys:dict[str,int] = {}
decode_keys: dict[int, str] = {}
for index, char in enumerate(vocab):
  encode_keys[char] = index
  decode_keys[index] = char

#As of this point, the full vocabulary should be indexed
print("Total unique characters:", vocab_size)
print("Total encode/decode dictionary keys:", len(encode_keys.keys()),"|", len(decode_keys.keys()))
print("Key list\n",encode_keys.keys())

In [None]:
# Construct input/output data for LSTM network
input_txt = full_txt_str[:-1]
output_txt = full_txt_str[1:]
train_data_list: list[np.ndarray] = []
train_tag_list: list[np.ndarray] = []
acc = 0 #Used to print partial progress of the operation
big_acc = 1 #Same as above

#The LSTM network will recieve a 'card' (vector) containing one square per vocab character. The desired character will be 'marked' as 1.0 whilst all others remain zero
for item in zip(input_txt, output_txt):
  data:np.ndarray = np.zeros(vocab_size, dtype=np.float64)
  tag:np.ndarray = np.zeros(vocab_size, dtype=np.float64)
  data[encode_keys[item[0]]] = 1.0
  tag[encode_keys[item[1]]] = 1.0

  train_data_list.append(data)
  train_tag_list.append(tag)
  #Print partial progress
  acc += 1
  if (acc >= len(input_txt)*0.2):
    print(f"IO data {10*big_acc}% built")
    acc = 0
    big_acc += 1

#As of this point, the full set of training data has been built
train_data = np.array(train_data_list)
train_tag = np.array(train_tag_list)
print("Train shape:", train_data.shape)
print("Tag shape:", train_tag.shape)

# Construcción del modelo

In [None]:
vect_model = Sequential()
vect_model.add(kr.Input(shape=(vocab_size,)))
vect_model.add(LSTM(100, return_sequences=True)) #Add dropout?
vect_model.add(Dense(vocab_size, activation="softmax"))

vect_model.compile(loss="categorical_crossentropy",optimizer="rmsprop", metrics=["accuracy"])

# Entrenamiento del modelo

In [None]:
vect_model.fit(train_data, train_tag, epochs = 10)

# Evaluación