<a href="https://colab.research.google.com/github/SolKidonakis/AA2TP2/blob/main/Ejercicio%202/AA2TP2EJ2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**TRABAJO PRACTICO N°2: APRENDIZAJE AUTOMATICO 2**

**ALUMNA: SOL KIDONAKIS**

**LIBRERIAS**

In [1]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, GRU
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dropout

In [None]:
# Configurar la GPU si está disponible
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

# Install tensorflow 2.15
!pip install tensorflow==2.15.0

**CARGA DEL DATASET**

In [2]:
# Descargar el dataset
path_to_file = tf.keras.utils.get_file(
    "shakespeare.txt",
    "https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt"
)

# Leer el contenido
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
print(f"Longitud del texto: {len(text)} caracteres")


Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
Longitud del texto: 1115394 caracteres


In [3]:
# Mostrar las primeras líneas
print(text[:500])


First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


In [4]:
vocab = sorted(set(text))
print(f'{len(vocab)} caracteres únicos en el texto')


65 caracteres únicos en el texto


**PREPROCESAMIENTO PARA EL MODELO CARACTER A CARACTER**

Vectorización del texto

In [5]:

# Crear la capa para convertir caracteres a IDs
ids_from_chars = tf.keras.layers.StringLookup(vocabulary=list(vocab), mask_token=None)

# Crear la capa inversa para convertir IDs a caracteres
chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)

# Función para convertir texto desde IDs
def text_from_ids(ids):
    return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)


Dividir el texto en secuencias de entrada y objetivo


In [6]:
# Convertir el texto a IDs
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))

# Dividir en secuencias de longitud deseada
seq_length = 100
examples_per_epoch = len(text) // seq_length
sequences = tf.data.Dataset.from_tensor_slices(all_ids).batch(seq_length+1, drop_remainder=True)

# Función para dividir en entrada y objetivo
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

# Crear dataset final
dataset = sequences.map(split_input_target)

# Crear batches
BATCH_SIZE = 64
BUFFER_SIZE = 10000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)


**DEFINICION MODELO CARACTER A CARACTER**

In [7]:
# Definir el modelo
class MyModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, rnn_units):
        super().__init__()
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(rnn_units,
                                       return_sequences=True,
                                       return_state=True)
        self.dense = tf.keras.layers.Dense(vocab_size)

    def call(self, inputs, states=None, return_state=False, training=False):
        x = self.embedding(inputs, training=training)
        if states is None:
            states = self.gru.get_initial_state(x)
        x, states = self.gru(x, initial_state=states, training=training)
        x = self.dense(x, training=training)
        return (x, states) if return_state else x

# Parámetros del modelo
vocab_size = len(ids_from_chars.get_vocabulary())
embedding_dim = 256
rnn_units = 1024

# Crear el modelo
model = MyModel(vocab_size=vocab_size, embedding_dim=embedding_dim, rnn_units=rnn_units)


**ENTRENAMIENTO DEL MODELO**

In [8]:
# Función de pérdida
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

# Compilar el modelo
model.compile(optimizer='adam', loss=loss)

# Entrenar el modelo
EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


**GENERACION DE TEXTO**

In [11]:
class OneStep(tf.keras.Model):
    def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
        super().__init__()
        self.temperature = temperature
        self.model = model
        self.chars_from_ids = chars_from_ids
        self.ids_from_chars = ids_from_chars

    @tf.function
    def generate_one_step(self, inputs, states=None):
        input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
        input_ids = self.ids_from_chars(input_chars).to_tensor()

        predicted_logits, states = self.model(inputs=input_ids, states=states, return_state=True)
        predicted_logits = predicted_logits[:, -1, :] / self.temperature
        predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
        predicted_ids = tf.squeeze(predicted_ids, axis=-1)

        predicted_chars = self.chars_from_ids(predicted_ids)
        return predicted_chars, states


In [22]:
# Fragmentos iniciales
start_strings = [
    "ROMEO: What light through yonder window breaks?",
    "JULIET: O Romeo, Romeo! wherefore art thou Romeo?",
    "HAMLET: To be or not to be, that is the question.",
    "MACBETH: Is this a dagger which I see before me?",
    "KING LEAR: Blow, winds, and crack your cheeks! Rage!"
]


In [23]:
def generate_for_start_strings(start_strings, temperatures=[0.5, 1.0, 1.5], sequence_lengths=[10, 20, 50], num_chars=300):
    """
    Genera texto para fragmentos iniciales con distintas temperaturas y longitudes.

    Args:
        start_strings (list[str]): Lista de fragmentos iniciales.
        temperatures (list[float]): Temperaturas a probar.
        sequence_lengths (list[int]): Longitudes de los fragmentos iniciales.
        num_chars (int): Número de caracteres a generar.

    Returns:
        dict: Resultados organizados por fragmento, temperatura y longitud.
    """
    results = {}

    for start_string in start_strings:
        results[start_string] = {}

        for temp in temperatures:
            one_step_model.temperature = temp
            results[start_string][f"Temperature {temp}"] = {}

            for length in sequence_lengths:
                truncated_start = start_string[:length]
                states = None
                next_char = tf.constant([truncated_start])
                result = [next_char]

                # Generar texto
                for _ in range(num_chars):
                    next_char, states = one_step_model.generate_one_step(next_char, states=states)
                    result.append(next_char)

                # Guardar el texto generado
                generated_text = tf.strings.join(result)[0].numpy().decode('utf-8')
                results[start_string][f"Temperature {temp}"][f"Length {length}"] = generated_text

    return results


In [24]:

temperatures = [0.5, 1.0, 1.5]  # Diferentes valores de temperatura
sequence_lengths = [10, 20, 50]  # Diferentes longitudes iniciales
num_chars = 300  # Número de caracteres a generar


results = generate_for_start_strings(start_strings, temperatures=temperatures, sequence_lengths=sequence_lengths, num_chars=num_chars)


In [25]:
# Mostrar los resultados generados
for start_string, temp_results in results.items():
    print(f"\n### Fragmento inicial: {start_string} ###")
    for temp, length_results in temp_results.items():
        print(f"\n--- {temp} ---")
        for length, text in length_results.items():
            print(f"\nLongitud inicial {length}:\n{text[:500]}")  # Mostrar los primeros 500 caracteres



### Fragmento inicial: ROMEO: What light through yonder window breaks? ###

--- Temperature 0.5 ---

Longitud inicial Length 10:
ROMEO: What is the part?

ROMEO:
The more I shall not see him that you have as death.

DUCHESS OF YORK:
Why, he shall die, the triumphs of the deep was state-bed,
And here is sweat as you, the world death is come to go.

BENVOLIO:
The foot are such a maid we heard of men,
The senate arms against the death, th

Longitud inicial Length 20:
ROMEO: What light the children less the heart of season
That I mean to see the measure of her very trimm
That you should beget the first hath seld me in:
And let him seek to go about her love.
O, we must need to me what I less you,
That we were deadly looks on him: the house
What counsel shall dear love, who are sent f

Longitud inicial Length 50:
ROMEO: What light through yonder window breaks?

SICINIUS:
Well, sir, you shall not say
When I am not been since to curse your head,
And trust my heart as the deep from men
That sh

Temperatura 0.5: Ideal para coherencia estilística, pero menos creativa.
Temperatura 1.0: Mejor equilibrio entre coherencia y creatividad.
Temperatura 1.5: Más creativo, pero menos confiable.
Longitud inicial:

Longitudes mayores (20-50): Generan texto más relacionado con el fragmento inicial y mejoran la coherencia temática.

Para resultados más fieles al estilo de Shakespeare, es mejor usar temperatura 1.0 y longitudes iniciales entre 20-50 caracteres.
Para experimentos más creativos, usar temperatura 1.5, aunque puede perder coherencia.

**PREPROCESAMIENTO PARA EL MODELO PALABRA A PALABRA**

In [None]:
# Tokenización a nivel de palabras
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])  # Construye el vocabulario a partir del texto

# Convertir el texto en una secuencia de índices de palabras
word_indices = tokenizer.texts_to_sequences([text])[0]
vocab_size = len(tokenizer.word_index) + 1  # +1 para incluir el índice 0
print(f"Vocab size: {vocab_size}")

Vocab size: 12633


In [None]:
# Crear secuencias de entrenamiento
seq_length = 10
sequences = []

for i in range(seq_length, len(word_indices)):
    seq = word_indices[i - seq_length:i + 1]  # secuencia de entrada + palabra objetivo
    sequences.append(seq)

# Convertir a TensorFlow Dataset
sequences = tf.constant(sequences)
dataset = tf.data.Dataset.from_tensor_slices(sequences)

# Dividir en input y target
def split_input_target(seq):
    input_text = seq[:-1]  # Todas menos la última palabra
    target_text = seq[-1]  # Última palabra
    return input_text, target_text

dataset = dataset.map(split_input_target)

# Mezclar y agrupar en batches
BATCH_SIZE = 64
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)


**DEFINICION DEL MODELO**

In [None]:
# Parámetros del modelo
embedding_dim = 256
rnn_units = 1024

# Construcción del modelo
class WordModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, rnn_units):
        super().__init__()
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(rnn_units, return_sequences=False, return_state=True)

        self.dense = tf.keras.layers.Dense(vocab_size)

    def call(self, inputs, states=None, return_state=False, training=False):
       x = self.embedding(inputs, training=training)
       if states is None:
         batch_size = tf.shape(inputs)[0]
         states = tf.zeros((batch_size, self.gru.units))
       x, states = self.gru(x, initial_state=states, training=training)
       x = self.dense(x, training=training)  # Salida directa
       if return_state:
          return x, states
       else:
          return x


# Instanciar el modelo
model = WordModel(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units
)


In [None]:
# Función de pérdida
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compilar el modelo
model.compile(optimizer='adam', loss=loss)

# Callback para checkpoints
checkpoint_dir = './word_training_checkpoints'
os.makedirs(checkpoint_dir, exist_ok=True)
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=f"{checkpoint_dir}/ckpt_{{epoch:02d}}.weights.h5",
    save_weights_only=True
)

# Entrenar el modelo
EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])


Epoch 1/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1036s[0m 324ms/step - loss: 6.6276
Epoch 2/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m794s[0m 249ms/step - loss: 5.5502
Epoch 3/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m740s[0m 232ms/step - loss: 4.6793
Epoch 4/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m725s[0m 228ms/step - loss: 3.4249
Epoch 5/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m722s[0m 227ms/step - loss: 2.3503
Epoch 6/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m723s[0m 227ms/step - loss: 1.6412
Epoch 7/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m723s[0m 227ms/step - loss: 1.2025
Epoch 8/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m711s[0m 223ms/step - loss: 0.9497
Epoch 9/10
[1m3188/3188[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m707s[0m 222ms/step - loss: 0.8287
Epoch 10/10
[1m3188/3188[0m [32m━

**GENERACION DE TEXTO**

In [None]:
def generate_text_word(model, seed_text, next_words, seq_length, temperature=1.0):
    for _ in range(next_words):
        # Convertir texto semilla en secuencia de índices
        token_list = tokenizer.texts_to_sequences([seed_text])[0]

        # Asegurar que la secuencia tenga la longitud esperada
        token_list = tf.keras.preprocessing.sequence.pad_sequences(
            [token_list], maxlen=seq_length, padding='pre'
        )

        # Obtener predicciones
        predictions = model.predict(token_list, verbose=0)
        predictions = predictions / temperature  # Ajustar por temperatura

        # Seleccionar la palabra más probable
        predicted_word_index = np.argmax(predictions, axis=-1)[0]

        # Obtener la palabra correspondiente
        output_word = tokenizer.index_word.get(predicted_word_index, "<UNK>")

        # Agregar palabra generada al texto semilla
        seed_text += " " + output_word
    return seed_text




In [None]:
# Configuración de parámetros
temperaturas = [0.5, 1.0, 1.5]  # Valores para evaluar
longitudes = [5, 10, 20]  # Longitudes de secuencia a probar
seed_text = "To be"  # Texto inicial para todas las pruebas
next_words = 50  # Número de palabras a generar


In [None]:
for seq_length in longitudes:
    print(f"\n--- Longitud de secuencia: {seq_length} ---\n")

    # Ajustar longitud de secuencia en el preprocesamiento
    tokenizer.fit_on_texts([text])
    word_indices = tokenizer.texts_to_sequences([text])[0]
    sequences = []
    for i in range(seq_length, len(word_indices)):
        seq = word_indices[i - seq_length:i + 1]
        sequences.append(seq)

    sequences = tf.constant(sequences)
    dataset = tf.data.Dataset.from_tensor_slices(sequences)
    dataset = dataset.map(split_input_target)
    dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)

    for temp in temperaturas:
        print(f"Temperatura: {temp}")
        print("Texto generado:\n")
        generated_text = generate_text_word(
            model=model,
            seed_text=seed_text,
            next_words=next_words,
            seq_length=seq_length,
            temperature=temp
        )
        print(generated_text)
        print("\n")



--- Longitud de secuencia: 5 ---

Temperatura: 0.5
Texto generado:

To be barren o o o prey sometime he's thy thought ready now they perceive best mercutio and fly alas witness sir now sits son and saving long prithee candle other they intend here true but they persuade noble service o they intend grace doing vantage now good marcius now be prompt


Temperatura: 1.0
Texto generado:

To be barren o o o prey sometime he's thy thought ready now they perceive best mercutio and fly alas witness sir now sits son and saving long prithee candle other they intend here true but they persuade noble service o they intend grace doing vantage now good marcius now be prompt


Temperatura: 1.5
Texto generado:

To be barren o o o prey sometime he's thy thought ready now they perceive best mercutio and fly alas witness sir now sits son and saving long prithee candle other they intend here true but they persuade noble service o they intend grace doing vantage now good marcius now be prompt



--- Longit

Impacto de Temperatura
Temperatura = 0.5:

Produce texto más conservador, con menor variación creativa.
Tiende a repetir frases o patrones similares.

Temperatura = 1.0:

Proporciona un equilibrio entre coherencia y creatividad.
Genera frases que tienen un flujo más natural y un mejor balance entre repetición y novedad.

Desventaja: Puede incluir frases menos conectadas.

Temperatura = 1.5:

Aumenta la creatividad, pero disminuye la coherencia.



Impacto de Longitud de Secuencia

Secuencia: 5 palabras

El modelo tiene menos contexto para generar texto, lo que resulta en frases más desconectadas.


Secuencia: 10 palabras

Proporciona un equilibrio adecuado entre contexto y creatividad.


Secuencia: 20 palabras

Ofrece el máximo contexto, lo que mejora la coherencia y el flujo lógico.
Puede limitar la creatividad, especialmente con temperatura baja.
