**TRABAJO PRACTICO N°2: APRENDIZAJE AUTOMATICO 2**

**MIEMBROS: SOL KIDONAKIS Y AGUSTIN ARENAS**

**LIBRERIAS**

In [50]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, GRU
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dropout

In [2]:
# Configurar la GPU si está disponible
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

**CARGA DEL DATASET**

In [3]:
# Descargar el dataset
path_to_file = tf.keras.utils.get_file(
    "shakespeare.txt",
    "https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt"
)

# Leer el contenido
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
print(f"Longitud del texto: {len(text)} caracteres")


Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
[1m1115394/1115394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Longitud del texto: 1115394 caracteres


In [5]:
# Mostrar las primeras líneas
print(text[:500])


First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


In [6]:
vocab = sorted(set(text))
print(f'{len(vocab)} caracteres únicos en el texto')


65 caracteres únicos en el texto


**PREPROCESAMIENTO PARA EL MODELO CARACTER A CARACTER**

Para el modelo carácter a carácter:
-Crear un vocabulario de caracteres únicos.
-Mapear caracteres a índices y viceversa.

In [60]:


# Mapear caracteres a índices y viceversa
char2idx = {char: idx for idx, char in enumerate(vocab)}
idx2char = {idx: char for idx, char in enumerate(vocab)}

# Convertir texto a índices
text_as_int = [char2idx[char] for char in text]

Generar secuencias de entrenamiento

In [73]:
# Longitud de secuencias para el modelo
SEQ_LENGTH = 100  
BATCH_SIZE = 64
BUFFER_SIZE = 10000

# Dividir el texto en secuencias
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(SEQ_LENGTH + 1, drop_remainder=True)

# Crear pares entrada-objetivo
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

**DEFINICION MODELO CARACTER A CARACTER**

In [79]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    inputs = tf.keras.Input(batch_shape=(batch_size, None))  # Definir las entradas explícitamente
    x = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim)(inputs)
    x = tf.keras.layers.GRU(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform')(x)
    outputs = tf.keras.layers.Dense(vocab_size)(x)
    return tf.keras.Model(inputs, outputs)

# Parámetros del modelo
vocab_size = len(vocab)
embedding_dim = 256
rnn_units = 1024

# Crear el modelo
model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)
model.summary()


**ENTRENAMIENTO DEL MODELO**

In [80]:
# Función de pérdida
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

# Compilar el modelo
model.compile(optimizer='adam', loss=loss)

# Entrenar el modelo
EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS)


Epoch 1/10
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m172s[0m 990ms/step - loss: 3.1132
Epoch 2/10
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m175s[0m 1s/step - loss: 1.9212
Epoch 3/10
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m175s[0m 1s/step - loss: 1.6491
Epoch 4/10
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m175s[0m 1s/step - loss: 1.5064
Epoch 5/10
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m174s[0m 1s/step - loss: 1.4226
Epoch 6/10
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m175s[0m 1s/step - loss: 1.3692
Epoch 7/10
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m183s[0m 1s/step - loss: 1.3260
Epoch 8/10
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m175s[0m 1s/step - loss: 1.2898
Epoch 9/10
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m176s[0m 1s/step - loss: 1.2525
Epoch 10/10
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m178s[0m

**GENERACION DE TEXTO**

In [93]:
# Crear un modelo con batch_size=1 para inferencia
gen_model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
gen_model.set_weights(model.get_weights())  # Cargar los pesos entrenados


In [99]:
def generate_text_char(model, seed_text, num_generate, temperature=1.0):
    """
    Genera texto carácter a carácter usando un modelo entrenado con batch_size=1.
    
    Args:
        model: Modelo entrenado carácter a carácter (batch_size=1).
        seed_text: Texto inicial para comenzar la generación.
        num_generate: Número de caracteres a generar.
        temperature: Controla la creatividad de las predicciones (más bajo = más predecible).
    
    Returns:
        Texto generado.
    """
    # Convertir texto semilla a índices
    input_eval = [char2idx[char] for char in seed_text]
    input_eval = tf.expand_dims(input_eval, 0)  # Agregar dimensión de lote

    # Lista para texto generado
    text_generated = []

    for _ in range(num_generate):
        # Generar predicciones
        predictions = model(input_eval)

        # Quitar dimensión del lote y ajustar por temperatura
        predictions = tf.squeeze(predictions, 0)  # Ahora será (sequence_length, vocab_size)
        predictions = predictions / temperature

        # Seleccionar el siguiente carácter
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()

        # Agregar carácter generado al texto
        text_generated.append(idx2char[predicted_id])

        # Actualizar la entrada para la próxima iteración
        input_eval = tf.expand_dims([predicted_id], 0)

    # Combinar el texto generado con el inicial
    return seed_text + ''.join(text_generated)


In [96]:
# Configuración de parámetros
temperaturas = [0.5, 1.0, 1.5]  # Diferentes temperaturas para probar
longitudes = [50, 100, 200]  # Diferentes longitudes de texto a generar
seed_text = "To be"  # Texto inicial


In [97]:
# Generar texto para cada combinación
for temp in temperaturas:
    print(f"\n--- Temperatura: {temp} ---\n")
    for length in longitudes:
        print(f"--- Longitud: {length} caracteres ---")
        generated_text = generate_text_char(
            model=gen_model,  # Modelo con batch_size=1
            seed_text=seed_text,
            num_generate=length,
            temperature=temp
        )
        print(generated_text)
        print("\n")


--- Temperatura: 0.5 ---

--- Longitud: 50 caracteres ---
To be the first of all the house of York,
That would no


--- Longitud: 100 caracteres ---
To be so strife: he has been done,
That had show it confess,
To see the time to the last,
And not not sti


--- Longitud: 200 caracteres ---
To begin to send thy lands
That we have sometime consul should be
the storm and seven such as the son:
The garden of the seasons of my son,
I'll vent that villain and the beggar of our grave?

GLOUCESTER:




--- Temperatura: 1.0 ---

--- Longitud: 50 caracteres ---
To beat and look chosen me
To figure our way, as they a


--- Longitud: 100 caracteres ---
To beat
Master resident, and chosen till Richmond,
imsolsture your leave should be taught,
And there a me


--- Longitud: 200 caracteres ---
To bear well affection, and, and lacks,
I should not have no lenitemanted sled.

GLOUCESTER:
Why?

RUTHOMAS:
Your prison when they sleep?

MARCIUS:
Thou nobless, marroy, PETESTRLIA:
Feward in the household



-

**PREPROCESAMIENTO PARA EL MODELO PALABRA A PALABRA**

In [18]:
# Tokenización a nivel de palabras
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])  # Construye el vocabulario a partir del texto

# Convertir el texto en una secuencia de índices de palabras
word_indices = tokenizer.texts_to_sequences([text])[0]
vocab_size = len(tokenizer.word_index) + 1  # +1 para incluir el índice 0
print(f"Vocab size: {vocab_size}")

Vocab size: 12633


In [19]:
# Crear secuencias de entrenamiento
seq_length = 10
sequences = []

for i in range(seq_length, len(word_indices)):
    seq = word_indices[i - seq_length:i + 1]  # secuencia de entrada + palabra objetivo
    sequences.append(seq)

# Convertir a TensorFlow Dataset
sequences = tf.constant(sequences)
dataset = tf.data.Dataset.from_tensor_slices(sequences)

# Dividir en input y target
def split_input_target(seq):
    input_text = seq[:-1]  # Todas menos la última palabra
    target_text = seq[-1]  # Última palabra
    return input_text, target_text

dataset = dataset.map(split_input_target)

# Mezclar y agrupar en batches
BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)


**DEFINICION DEL MODELO**

In [20]:
# Parámetros del modelo
embedding_dim = 256
rnn_units = 1024

# Construcción del modelo
class WordModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, rnn_units):
        super().__init__()
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(rnn_units, return_sequences=False, return_state=True)

        self.dense = tf.keras.layers.Dense(vocab_size)

    def call(self, inputs, states=None, return_state=False, training=False):
       x = self.embedding(inputs, training=training)
       if states is None:
         batch_size = tf.shape(inputs)[0]
         states = tf.zeros((batch_size, self.gru.units))
       x, states = self.gru(x, initial_state=states, training=training)
       x = self.dense(x, training=training)  # Salida directa
       if return_state:
          return x, states
       else:
          return x


# Instanciar el modelo
model = WordModel(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units
)


In [21]:
# Función de pérdida
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compilar el modelo
model.compile(optimizer='adam', loss=loss)

# Callback para checkpoints
checkpoint_dir = './word_training_checkpoints'
os.makedirs(checkpoint_dir, exist_ok=True)
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=f"{checkpoint_dir}/ckpt_{{epoch:02d}}.weights.h5",
    save_weights_only=True
)

# Entrenar el modelo
EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])


Epoch 1/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1036s[0m 324ms/step - loss: 6.6276
Epoch 2/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m794s[0m 249ms/step - loss: 5.5502
Epoch 3/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m740s[0m 232ms/step - loss: 4.6793
Epoch 4/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m725s[0m 228ms/step - loss: 3.4249
Epoch 5/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m722s[0m 227ms/step - loss: 2.3503
Epoch 6/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m723s[0m 227ms/step - loss: 1.6412
Epoch 7/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m723s[0m 227ms/step - loss: 1.2025
Epoch 8/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m711s[0m 223ms/step - loss: 0.9497
Epoch 9/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m707s[0m 222ms/step - loss: 0.8287
Epoch 10/10
[1m3188/3188[0m [32m━

**GENERACION DE TEXTO**

In [32]:
def generate_text_word(model, seed_text, next_words, seq_length, temperature=1.0):
    for _ in range(next_words):
        # Convertir texto semilla en secuencia de índices
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        
        # Asegurar que la secuencia tenga la longitud esperada
        token_list = tf.keras.preprocessing.sequence.pad_sequences(
            [token_list], maxlen=seq_length, padding='pre'
        )
        
        # Obtener predicciones
        predictions = model.predict(token_list, verbose=0)
        predictions = predictions / temperature  # Ajustar por temperatura
        
        # Seleccionar la palabra más probable
        predicted_word_index = np.argmax(predictions, axis=-1)[0]
        
        # Obtener la palabra correspondiente
        output_word = tokenizer.index_word.get(predicted_word_index, "<UNK>")
        
        # Agregar palabra generada al texto semilla
        seed_text += " " + output_word
    return seed_text




In [34]:
# Configuración de parámetros
temperaturas = [0.5, 1.0, 1.5]  # Valores para evaluar
longitudes = [5, 10, 20]  # Longitudes de secuencia a probar
seed_text = "To be"  # Texto inicial para todas las pruebas
next_words = 50  # Número de palabras a generar


In [35]:
for seq_length in longitudes:
    print(f"\n--- Longitud de secuencia: {seq_length} ---\n")
    
    # Ajustar longitud de secuencia en el preprocesamiento
    tokenizer.fit_on_texts([text])
    word_indices = tokenizer.texts_to_sequences([text])[0]
    sequences = []
    for i in range(seq_length, len(word_indices)):
        seq = word_indices[i - seq_length:i + 1]
        sequences.append(seq)

    sequences = tf.constant(sequences)
    dataset = tf.data.Dataset.from_tensor_slices(sequences)
    dataset = dataset.map(split_input_target)
    dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)

    for temp in temperaturas:
        print(f"Temperatura: {temp}")
        print("Texto generado:\n")
        generated_text = generate_text_word(
            model=model,
            seed_text=seed_text,
            next_words=next_words,
            seq_length=seq_length,
            temperature=temp
        )
        print(generated_text)
        print("\n")



--- Longitud de secuencia: 5 ---

Temperatura: 0.5
Texto generado:

To be barren o o o prey sometime he's thy thought ready now they perceive best mercutio and fly alas witness sir now sits son and saving long prithee candle other they intend here true but they persuade noble service o they intend grace doing vantage now good marcius now be prompt


Temperatura: 1.0
Texto generado:

To be barren o o o prey sometime he's thy thought ready now they perceive best mercutio and fly alas witness sir now sits son and saving long prithee candle other they intend here true but they persuade noble service o they intend grace doing vantage now good marcius now be prompt


Temperatura: 1.5
Texto generado:

To be barren o o o prey sometime he's thy thought ready now they perceive best mercutio and fly alas witness sir now sits son and saving long prithee candle other they intend here true but they persuade noble service o they intend grace doing vantage now good marcius now be prompt



--- Longit

Impacto de Temperatura
Temperatura = 0.5:

Produce texto más conservador, con menor variación creativa.
Tiende a repetir frases o patrones similares.

Temperatura = 1.0:

Proporciona un equilibrio entre coherencia y creatividad.
Genera frases que tienen un flujo más natural y un mejor balance entre repetición y novedad.

Desventaja: Puede incluir frases menos conectadas.

Temperatura = 1.5:

Aumenta la creatividad, pero disminuye la coherencia.



Impacto de Longitud de Secuencia

Secuencia: 5 palabras

El modelo tiene menos contexto para generar texto, lo que resulta en frases más desconectadas.


Secuencia: 10 palabras

Proporciona un equilibrio adecuado entre contexto y creatividad.


Secuencia: 20 palabras

Ofrece el máximo contexto, lo que mejora la coherencia y el flujo lógico.
Puede limitar la creatividad, especialmente con temperatura baja.
