<a href="https://colab.research.google.com/github/Nicolas-Noir/AA2-TUIA-2025/blob/main/Practica/AA2_lab9_a_28_04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 9 - Clasificacion de texto con RNNs

El objetivo de este laboratorio es entrenar un clasificador de reseñas de películas utilizando una red neuronal recurrente (RNN). Para ello, se utilizará el dataset de IMDB, que contiene 50,000 reseñas de películas etiquetadas como positivas o negativas.

## Preparación del entorno.

Si no estamos parados en el repo, clonar y cd al repo. Esto nos permite usar el mismo notebook tanto local como en Google Colab.

In [None]:
import os

REPO_NAME = "lab9"
if REPO_NAME not in os.getcwd():
  if not os.path.exists(REPO_NAME):
    !git clone https://github.com/FCEIA-AAII/{REPO_NAME}.git
  os.chdir(REPO_NAME)


Cloning into 'lab9'...
remote: Enumerating objects: 32, done.[K
remote: Counting objects: 100% (32/32), done.[K
remote: Compressing objects: 100% (26/26), done.[K
remote: Total 32 (delta 17), reused 15 (delta 5), pack-reused 0 (from 0)[K
Receiving objects: 100% (32/32), 36.73 KiB | 5.25 MiB/s, done.
Resolving deltas: 100% (17/17), done.


Importar librerías

In [None]:
import numpy as np
from pathlib import Path
import tensorflow_datasets as tfds
import tensorflow as tf
import matplotlib.pyplot as plt

Establecer GPU por defecto en caso de estar disponible.

In [None]:
# Configurar para que TensorFlow utilice la GPU por defecto
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Configurar para que TensorFlow asigne memoria dinámicamente
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        # Especificar la GPU por defecto
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Manejar error
        print(e)

1 Physical GPUs, 1 Logical GPUs


Cargar dataset:

In [None]:
dataset, info = tfds.load('imdb_reviews', with_info=True,
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']



Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.0VW0HZ_1.0.0/imdb_reviews-train.tfrecor…

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.0VW0HZ_1.0.0/imdb_reviews-test.tfrecord…

Generating unsupervised examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.0VW0HZ_1.0.0/imdb_reviews-unsupervised.…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


Inspeccionar dataset:

In [None]:
info

tfds.core.DatasetInfo(
    name='imdb_reviews',
    full_name='imdb_reviews/plain_text/1.0.0',
    description="""
    Large Movie Review Dataset. This is a dataset for binary sentiment
    classification containing substantially more data than previous benchmark
    datasets. We provide a set of 25,000 highly polar movie reviews for training,
    and 25,000 for testing. There is additional unlabeled data for use as well.
    """,
    config_description="""
    Plain text
    """,
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    data_dir='/root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0',
    file_format=tfrecord,
    download_size=80.23 MiB,
    dataset_size=129.83 MiB,
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
        'text': Text(shape=(), dtype=string),
    }),
    supervised_keys=('text', 'label'),
    disable_shuffling=False,
    nondeterministic_order=False,
    splits={
        'test': <SplitInfo num_e

In [None]:
train_dataset.element_spec

(TensorSpec(shape=(), dtype=tf.string, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

Vemos que el primer elemento es un tf.string, que contiene la reseña de la película. El segundo elemento es un tf.int64, que contiene la etiqueta de la reseña (0 para negativa, 1 para positiva).

Inspeccionamos algunos ejemplos:

In [None]:
for example, label in train_dataset.take(1):
  print('text: ', example.numpy())
  print('label: ', label.numpy())

text:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
label:  0


Realizar el análisis exploratorio de datos correspondiente. Por ejemplo, podemos analizar:
  - Distribución de clases.
  - Longitud de las reseñas.
  - Cantidad de palabras distintas.
  - Palabras menos/más frecuentes.

In [None]:
# Implementar el análisis exploratorio correspondiente.

In [None]:
buenas=0
malas=0
palabras_distintas={}
largo_palabras=[]


for example, label in train_dataset:
  if label == 0:
    buenas+=1
  elif label == 1:
    malas+=1

  texto = example.numpy().decode('utf-8')

  largo_palabras.append(len(texto))

  for palabras in texto.split():
    palabras = palabras.lower()
    if palabras in palabras_distintas:
      palabras_distintas[palabras] += 1
    else:
      palabras_distintas[palabras] = 1

palabras_min = dict(sorted(palabras_distintas.items(), key=lambda item: item[1]))
palabras_max = dict(sorted(palabras_distintas.items(), key=lambda item: item[1] , reverse=True))


print(buenas, malas)
print(sum(largo_palabras)/len(largo_palabras))
print(len(palabras_distintas))
print(dict(list(palabras_min.items())[-20:]))
print(dict(list(palabras_max.items())[-20:]))

12500 12500
1325.06964
251637
{'his': 29059, 'movie': 30887, 'on': 31619, 'but': 39764, 'with': 42729, 'for': 42843, 'as': 45102, 'was': 47024, '/><br': 50935, 'it': 65505, 'that': 66292, 'this': 69714, 'i': 70480, 'in': 90527, 'is': 104171, 'to': 133967, 'of': 144462, 'and': 158572, 'a': 159953, 'the': 322198}
{'positives-': 1, 'well-shot.': 1, '1/5,': 1, 'boo-ing': 1, '4-d': 1, 'spacewalk': 1, 'bit...well...trite.': 1, 'moonwalk.': 1, "'dogma'": 1, "end'.": 1, 'drusse?': 1, 'mayble': 1, 'riget3?': 1, 'puppet-driven': 1, "'watcher'": 1, 'dreams?<br': 1, 'ends...': 1, 'expect-': 1, '/>greets,': 1, 'slaj<br': 1}
