# Clasificación de Reseñas de Películas
La premisa es que cada reseña de película puede ser positiva o negativa, con base en el contenido de la reseña.


## Carga del Dataset


Ten en cuenta que el valor `num_words=10000` corresponde a usar solo las primeras 10000 palabras más frecuentes en el dataset de entrenamiento.


In [1]:
# Inicialmente estaba en 10000 pero en COLAB Free no alcanza a entrenar
NUM_WORDS=5000 

In [3]:
from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(
    num_words=NUM_WORDS)

2023-06-04 15:39:04.321450: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-04 15:39:04.720036: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-06-04 15:39:04.720072: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-06-04 15:39:04.820160: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-06-04 15:39:09.320110: W tensorflow/stream_executor/pla

Teniendo en cuenta que ninguna palabra puede exceder el índice 10000, como máximo tendremos este índice en los conjuntos de entrenamiento y prueba.

In [4]:
max([max(sequence) for sequence in train_data])

4999

Ahora veamos la primera reseña en el dataset de entrenamiento y su clasificación

In [5]:
print("Review: ", train_data[0])
print("Classification:", 'Positive' if train_labels[0] else 'Negative')

Review:  [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 2, 19, 178, 32]
Classification: Positive


Como puedes notar, el review es una lista de números o índices de palabras. Sin embargo, podemos definir una función para decodificar cualquier reseña. (Ten en cuenta que alguna palabra no tan frecuente puede desaparecer.)

In [6]:
# Cargamos el diccionario que mapea las palabras a índices enteros
word_index = imdb.get_word_index()

# Reversamos para ahora mapear índices enteros en palabras
reverse_word_index = dict(
    [(value, key) for (key, value) in word_index.items()]
    )

# Definimos una función lambda para decodificar cualquier review.
# Los primeros tres índices corresponden a `padding`, `start of sequence` y `unknown`
decoded_review = lambda review: ' '.join(
    [reverse_word_index.get(i - 3, '?') for i in review])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [7]:
decoded_review(train_data[2])

"? this has to be one of the worst films of the ? when my friends i were watching this film being the target audience it was aimed at we just sat watched the first half an hour with our jaws touching the floor at how bad it really was the rest of the time everyone else in the theatre just started talking to each other leaving or generally crying into their popcorn that they actually paid money they had ? working to watch this ? excuse for a film it must have looked like a great idea on paper but on film it looks like no one in the film has a clue what is going on crap acting crap costumes i can't get across how ? this is to watch save yourself an hour a bit of your life"

## Preparación de los datos

Dado que no podemos ingresar listas de enteros en la red neuronal, debemos convertirlas primero a "tensores". Para ello, debemos hacer lo siguiente:

* Ajustar las listas para que tengan la misma longitud.
* Convertirlas en tensores de enteros con tamaño `(samples, word_indices)`.
* Usar la primera capa de la red para que maneje estos tensores, esto es, usando una capa de tipo `Embedding`.
* Emplear One-hot encode para las listas. Es decir, enmascarar con unos y ceros la lista utilizando un vector de 10000 elementos. Si la reseña es `[1, 3]`, el vector codificado será `[0, 1, 0, 1, 0, 0, ...]`.

In [8]:
import numpy as np

def vectorize_sequences(sequences, dimension=NUM_WORDS):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

Como te puedes dar cuenta, con `vectorize_sequences` cumplimos las condiciones descritas antes. 

In [21]:
x_train

array([[0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       ...,
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.],
       [0., 1., 1., ..., 0., 0., 0.]])

In [23]:
y_train

array([1., 0., 0., ..., 0., 1., 0.], dtype=float32)

Ahora, vectorizamos también las etiquetas y las adecuamos en términos de formato.

In [10]:
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

In [12]:
#Importing libraries
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [15]:
SVM = SVC()
SVM.fit(x_train, y_train)

#perform classification and prediction on samples in tf_test
predicted_SVM = SVM.predict(x_test)
print(classification_report(y_test, predicted_SVM))

              precision    recall  f1-score   support

         0.0       0.89      0.87      0.88     12500
         1.0       0.88      0.90      0.89     12500

    accuracy                           0.89     25000
   macro avg       0.89      0.89      0.89     25000
weighted avg       0.89      0.89      0.89     25000



In [18]:
import joblib
joblib.dump(SVM, 'saved_model2.pkl')


['saved_model2.pkl']

In [36]:
input_phrase = 'I hated that movie'
vectorized_input = vectorize_sequences([[word_index for word_index in imdb.get_word_index().values() if word_index < NUM_WORDS and word_index != 0]])
predicted_sentiment = SVM.predict(vectorized_input)
print(predicted_sentiment)

[1.]


Hasta este punto, ya la información está lista para ser ingresada a la red neuronal.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=7a5b1e8f-5fb8-49ad-8f36-77068147d699' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>