# Taller Dirigido

## Clasificación de textos con FFNN

En este taller dirigido aprenderemos a entrenar una red neuronal simple para resolver un problema de clasificación de texto. Para ello, trabajaremos con un dataset relativamente pequeño que contine el argumento de un conjunto de películas y la categoría a la que éstas pertenecen. El dataset cuenta con 27 etiquetas, por lo que se trata de un problema de clasificación multiclase.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("train_data.txt", sep=":::", header=None)
df.columns = ["id", "titulo", "categoria", "plot"]
df.drop(columns=["id", "titulo"], inplace=True)
df["plot"] = df["plot"].str.strip().replace("'","")
df["categoria"] = df["categoria"].str.strip()
df.head()

  df = pd.read_csv("train_data.txt", sep=":::", header=None)


Unnamed: 0,categoria,plot
0,drama,Listening in to a conversation between his doc...
1,thriller,A brother and sister with a past incestuous re...
2,adult,As the bus empties the students for their fiel...
3,drama,To help their unemployed father make ends meet...
4,drama,The film's title refers not only to the un-rec...


Verificamos que tenemos 27 categorías:

In [3]:
df["categoria"].nunique()

27

El primer paso consiste en tokenizar el texto para crear un diccionario de palabras que constituye el vocabulario del corpus. Posteriormente, se transforma cada texto del corpus en una secuencia de palabras representadas por su índice en el diccionario.

In [5]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=None)
tokenizer.fit_on_texts(df["plot"])
sequences = tokenizer.texts_to_sequences(df["plot"])
word_index = tokenizer.word_index

Para entender qué se está haciendo en este paso, observe el contenido de la primera fila del conjunto de datos, la secuencia generada y el significado de cada número de la seucencia.

In [6]:
df.iloc[0]["plot"]

'Listening in to a conversation between his doctor and parents, 10-year-old Oscar learns what nobody has the courage to tell him. He only has a few weeks to live. Furious, he refuses to speak to anyone except straight-talking Rose, the lady in pink he meets on the hospital stairs. As Christmas approaches, Rose uses her fantastical experiences as a professional wrestler, her imagination, wit and charm to allow Oscar to live life and love to the full, in the company of his friends Pop Corn, Einstein, Bacon and childhood sweetheart Peggy Blue.'

In [7]:
sequences[0][:10]

[5241, 6, 5, 3, 2020, 96, 8, 526, 2, 231]

In [8]:
secuencia = sequences[0][:10]
for idx in secuencia:
    for key, value in word_index.items():
        if value == idx:
            print(f"{value}: {key}")

5241: listening
6: in
5: to
3: a
2020: conversation
96: between
8: his
526: doctor
2: and
231: parents


Dado que la capa de entrada debe tener un tamaño fijo, la cantidad de palabras a considerar por texto también debe serlo. Para ello, empleamos pad_sequences. Tenemos dos casos: cuando la oración es más corta que la máxima longitud y cuando la oración es más larga.

In [9]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder

max_length = 70
encoder = LabelEncoder()
X = pad_sequences(sequences, maxlen=max_length, truncating="post")
X = [[x - 1 for x in row] for row in X]
X = np.array(X)
y = encoder.fit_transform(df["categoria"])

In [10]:
df.iloc[0]["plot"]

'Listening in to a conversation between his doctor and parents, 10-year-old Oscar learns what nobody has the courage to tell him. He only has a few weeks to live. Furious, he refuses to speak to anyone except straight-talking Rose, the lady in pink he meets on the hospital stairs. As Christmas approaches, Rose uses her fantastical experiences as a professional wrestler, her imagination, wit and charm to allow Oscar to live life and love to the full, in the company of his friends Pop Corn, Einstein, Bacon and childhood sweetheart Peggy Blue.'

In [11]:
sequences[0][:10]

[5241, 6, 5, 3, 2020, 96, 8, 526, 2, 231]

In [12]:
secuencia = X[0]
for idx in secuencia:
    for key, value in word_index.items():
        if value == idx:
            print(f"{value}: {key}")

5240: root
5: to
4: of
2: and
2019: carry
95: woman
7: is
525: due
1: the
230: without
647: quickly
90: documentary
61: people
2633: sara
439: 000
54: time
2300: jr
25: but
1462: board
4: of
456: question
31: when
10: her
58: or
25: but
2: and
291: dream
1265: charlie
4: of
126: house
4442: mask
10: her
943: violent
4: of
1469: staff
4: of
868: towards
1883: inhabitants
1609: per
1463: courage
1441: access
800: let
5: to
5090: missed
10: her
192: decides
13: that
567: mr
8797: hye
14: on
865: indian
2849: card
1441: access
882: san
9: with
10606: gao
590: 1
14: on
2: and
722: system
9183: cinematographic
9: with
1921: balance
5053: unscrupulous
1: the
3688: occupied
4: of


Separamos el conjunto de datos en entrenamiento y prueba.

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

In [14]:
X_train.shape

(43371, 70)

Se define la arquitectura de la red neuronal. Se agrega una capa de embedding, que se encarga de obtener el embedding de cada palabra que recibe de la secuencia.

In [15]:
import tensorflow as tf
from tensorflow import keras

model = keras.Sequential([
    keras.layers.Embedding(input_dim=len(word_index) + 1, output_dim=10, input_length=max_length),
    keras.layers.Flatten(),
    keras.layers.Dense(10, activation='relu'),
    # keras.layers.Dropout(0.3),
    keras.layers.Dense(27, activation='softmax')
])



Compilamos la red neuronal, seleccionando sparse_categorical_crossentropy debido a que es un problema multiclase.

In [16]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=20, verbose=1, validation_data=(X_test, y_test))

Epoch 1/20
[1m1356/1356[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 16ms/step - accuracy: 0.3337 - loss: 2.4010 - val_accuracy: 0.4707 - val_loss: 1.8531
Epoch 2/20
[1m1356/1356[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 16ms/step - accuracy: 0.5304 - loss: 1.6151 - val_accuracy: 0.5044 - val_loss: 1.7240
Epoch 3/20
[1m1001/1356[0m [32m━━━━━━━━━━━━━━[0m[37m━━━━━━[0m [1m5s[0m 15ms/step - accuracy: 0.6766 - loss: 1.1615

KeyboardInterrupt: 

In [None]:
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Pérdida: {loss}, Precisión: {accuracy}")

[1m80/80[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.4438 - loss: 1.9294
Pérdida: 1.9546247720718384, Precisión: 0.4391389489173889


## Redes Recurrentes para modelos de lenguaje
En esta sección entrenaremos una red recurrente sencilla (no LSTM) con la finalidad de generar un modelo de lenguaje neuronal básico.

In [17]:
dfS = pd.DataFrame({
    'textos': [
        "El aprendizaje automático es una rama de la inteligencia artificial.",
        "Las redes neuronales recurrentes son útiles para procesar texto.",
        "Podemos capturar dependencias secuenciales en el lenguaje con RNN."
    ]
})

In [18]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(dfS['textos'])
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(dfS['textos'])

# Crear datos de entrenamiento
sequence_length = 5
X, y = [], []

for seq in sequences:
    for i in range(len(seq) - sequence_length):
        X.append(seq[i:i + sequence_length])
        y.append(seq[i + sequence_length])

X = np.array(X)
y = np.array(y)

Esta parte es similar a la usada en la sección previa, salvo por la manera en la que se generan los vectores X e Y, ya que el problema deja de ser de clasificación. Como puede verse en el código y comprobarse en lo siguiente, lo que se hace es tomar los "sequence_length" palabras para predecir la siguiente del texto:

In [19]:
X[0]

array([1, 2, 3, 4, 5])

In [20]:
y[0]

np.int64(6)

In [21]:
secuencia = np.append(X[0], y[0])
for idx in secuencia:
    for key, value in word_index.items():
        if value == idx:
            print(f"{value}: {key}")

1: el
2: aprendizaje
3: automático
4: es
5: una
6: rama


Asimismo, se puede observar que la ventana es de 1 palabra en 1 palabra:

In [22]:
X[0]

array([1, 2, 3, 4, 5])

In [23]:
X[1]

array([2, 3, 4, 5, 6])

Definimos el modelo

In [24]:
model = keras.Sequential([
    keras.layers.Embedding(input_dim=len(word_index) + 1, output_dim=16),
    # Estamos usando una red RNN simple, pero se pueden acoplar RNN para mejorar su potencial
    # keras.layers.SimpleRNN(64, return_sequences=True),
    keras.layers.SimpleRNN(64),
    keras.layers.Dense(len(word_index) + 1, activation='softmax')
])

In [26]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(X, y, epochs=50, verbose=1)

Epoch 1/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4s/step - accuracy: 0.0000e+00 - loss: 3.3279
Epoch 2/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 67ms/step - accuracy: 0.0769 - loss: 3.3026
Epoch 3/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step - accuracy: 0.0769 - loss: 3.2772
Epoch 4/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step - accuracy: 0.3846 - loss: 3.2514
Epoch 5/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step - accuracy: 0.3846 - loss: 3.2251
Epoch 6/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step - accuracy: 0.5385 - loss: 3.1980
Epoch 7/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step - accuracy: 0.5385 - loss: 3.1698
Epoch 8/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step - accuracy: 0.6154 - loss: 3.1403
Epoch 9/50
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m 

<keras.src.callbacks.history.History at 0x7dd3fb9f0e10>

In [27]:
def generate_text(seed_text, next_words=1):
    for _ in range(next_words):
        tokenized = tokenizer.texts_to_sequences([seed_text])
        tokenized_padded = pad_sequences(tokenized, maxlen=sequence_length)
        predicted_index = np.argmax(model.predict(tokenized_padded), axis=-1)
        predicted_word = [word for word, index in word_index.items() if index == predicted_index][0]
        seed_text += " " + predicted_word
    return seed_text

# Ejemplo de generación de texto
print(generate_text("La inteligencia artificial es una"))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 272ms/step
La inteligencia artificial es una rama
