Dado el siguiente procedimiento para clasificar sentimientos usando el conjunto de datos imdb.
1. Ejecute el procedimiento y compare el resultado de las variables accuracy_lstm y accuracy_cnn_lstm.
2. Replique el procedimiento para el conjunto de datos enviado en anexo para crear un clasificador de sentimientos en espaniol (Big_AHR.csv.zip).
3. Compare y muestre los resultados obtenidos usando solo LSTM y CNN + LSTM de sus clasificador en espaniol.


(*) En caso de problema de ejecución por falta de recursos. puede crear  un subconjunto del archivo Big_AHR.csv.zip

(*) Use los siguientes links como referencia.

1. https://github.com/anandsarank/cnn-lstm-text-classification/blob/main/CNN%20with%20LSTM%20for%20Text%20Classification.ipynb
2. https://colab.research.google.com/github/alvinntnu/python-notes/blob/master/nlp/sentiment-analysis-lstm-v1.ipynb
3. https://www.kaggle.com/code/chizhikchi/lstm-binary-sentiment-classification-for-spanish/notebook
4. https://www.kaggle.com/code/chizhikchi/ahr-corpus-presentation


In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Conv1D,MaxPooling1D
from tensorflow.keras.layers import LSTM,Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.callbacks import ModelCheckpoint
np.random.seed(7)
from prettytable import PrettyTable
import warnings
warnings.filterwarnings('ignore')

In [None]:
# load the dataset but only keep the top n words, zero the rest
top_words = 10000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)
X_train,X_cv,y_train,y_cv = train_test_split(X_train,y_train,test_size = 0.2)
print("Shape of train data:", X_train.shape)
print("Shape of Test data:", X_test.shape)
print("Shape of CV data:", X_cv.shape)

# truncate and pad input sequences
max_review_length = 600
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
X_cv = sequence.pad_sequences(X_cv,maxlen=max_review_length)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
Shape of train data: (20000,)
Shape of Test data: (25000,)
Shape of CV data: (5000,)


# Utilizando el dataset IMDB

## LSTM

In [None]:
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
filepath="weights_best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max',save_weights_only=True)
callbacks_list = [checkpoint]
model.fit(X_train, y_train, epochs=5, batch_size=256,verbose = 1,callbacks = callbacks_list,validation_data=(X_cv,y_cv))

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 600, 32)           320000    
                                                                 
 lstm (LSTM)                 (None, 100)               53200     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 373,301
Trainable params: 373,301
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/5
Epoch 1: val_accuracy improved from -inf to 0.83200, saving model to weights_best.hdf5
Epoch 2/5
Epoch 2: val_accuracy improved from 0.83200 to 0.86080, saving model to weights_best.hdf5
Epoch 3/5
Epoch 3: val_accuracy improved from 0.86080 to 0.88000, saving model to weights_best.hdf5
Epoch 4/5
Ep

<keras.callbacks.History at 0x79421788be20>

In [None]:
# Final evaluation of the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.load_weights("weights_best.hdf5")
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
scores = model.evaluate(X_test, y_test, verbose=1,batch_size = 256)
accuracy_lstm = scores[1]*100
print("Accuracy using LSTM: %.2f%%" % (accuracy_lstm))

Accuracy using LSTM: 87.30%


## CNN + LSTM

In [None]:
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
filepath="weights_best_cnn.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max',save_weights_only=True)
callbacks_list = [checkpoint]
model.fit(X_train, y_train, epochs=5, batch_size=256,verbose = 1,callbacks = callbacks_list,validation_data=(X_cv,y_cv))

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 600, 32)           320000    
                                                                 
 conv1d (Conv1D)             (None, 600, 32)           3104      
                                                                 
 max_pooling1d (MaxPooling1D  (None, 300, 32)          0         
 )                                                               
                                                                 
 lstm_2 (LSTM)               (None, 100)               53200     
                                                                 
 dense_2 (Dense)             (None, 1)                 101       
                                                                 
Total params: 376,405
Trainable params: 376,405
Non-trainable params: 0
________________________________________________

<keras.callbacks.History at 0x79420f87a7d0>

In [None]:
# Final evaluation of the model
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.load_weights("weights_best_cnn.hdf5")
scores = model.evaluate(X_test, y_test, verbose=0)
accuracy_cnn_lstm = scores[1]*100
print("Accuracy CNN using LSTM: %.2f%%" % (accuracy_cnn_lstm))

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 600, 32)           320000    
                                                                 
 conv1d_1 (Conv1D)           (None, 600, 32)           3104      
                                                                 
 max_pooling1d_1 (MaxPooling  (None, 300, 32)          0         
 1D)                                                             
                                                                 
 lstm_3 (LSTM)               (None, 100)               53200     
                                                                 
 dense_3 (Dense)             (None, 1)                 101       
                                                                 
Total params: 376,405
Trainable params: 376,405
Non-trainable params: 0
________________________________________________

## Resultados

In [None]:
table = PrettyTable()
table.field_names = ['Model', 'Accuracy']
table.add_row(['LSTM', accuracy_cnn_lstm])
table.add_row(['CNN using LSTM', accuracy_lstm])
print(table)

+----------------+-------------------+
|     Model      |      Accuracy     |
+----------------+-------------------+
|      LSTM      | 87.32399940490723 |
| CNN using LSTM | 87.30400204658508 |
+----------------+-------------------+


Los resultados muestran que el modelo CNN+LSTM tuvo un tiempo de entrenamiento significativamente más rápido que el modelo LSTM.

Esto sucede debido a que el modelo CNN+LSTM, gracias a la capa CNN permite al modelo aprender características espaciales y temporales de manera más eficiente, en este caso en concreto ayudan a capturar patrones locales y a reducir la dimensaionalidad del texto.  

Con respecto a la precisión, ambos modelos son igual de eficientes, pero dado que el modelo CNN+LSTM logró obtener resultados comparables con una fracción del tiempo de entrenamiento requerido por el modelo LSTM, es evidente que el enfoque CNN+LSTM es más eficiente en términos de tiempo de entrenamiento en este caso en particular

# Utilizando el dataset BIG_AHR

## LSTM

In [None]:
import time
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

# Define hyperparameters
embedding_vector_length = 32
lstm_units = 100
dropout_rate = 0.2
num_epochs = 5
batch_size = 256

# Create the model
model = Sequential()
model.add(Embedding(input_dim=top_words,
                    output_dim=embedding_vector_length,
                    input_length=max_review_length,
                    name="embedding_layer"))
model.add(LSTM(units=lstm_units,
               dropout=dropout_rate,
               recurrent_dropout=dropout_rate,
               name="lstm_layer"))
model.add(Dense(units=1, activation='sigmoid', name="output_layer"))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print model summary
print(model.summary())

# Specify the path for model weights
weights_filepath="weights_best.hdf5"

# Set callbacks
checkpoint = ModelCheckpoint(weights_filepath,
                             monitor='val_accuracy',
                             verbose=1,
                             save_best_only=True,
                             mode='max',
                             save_weights_only=True)
early_stop = EarlyStopping(monitor='val_accuracy',
                           patience=5,
                           restore_best_weights=True)

callbacks_list = [checkpoint, early_stop]

# Record the start time
start_time = time.time()

# Fit the model
model.fit(X_train,
          y_train,
          epochs=num_epochs,
          batch_size=batch_size,
          verbose=1,
          callbacks=callbacks_list,
          validation_data=(X_cv, y_cv))

# Record the end time
end_time = time.time()

# Calculate and print the time taken to train the model
training_time = end_time - start_time
print(f'The model took {training_time} seconds to train.')


Model: "sequential_14"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_layer (Embedding)  (None, 600, 32)          320000    
                                                                 
 lstm_layer (LSTM)           (None, 100)               53200     
                                                                 
 output_layer (Dense)        (None, 1)                 101       
                                                                 
Total params: 373,301
Trainable params: 373,301
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/5
Epoch 1: val_accuracy improved from -inf to 0.71768, saving model to weights_best.hdf5
Epoch 2/5
Epoch 2: val_accuracy did not improve from 0.71768
Epoch 3/5
Epoch 3: val_accuracy did not improve from 0.71768
Epoch 4/5
Epoch 4: val_accuracy did not improve from 0.71768
Epoch 5/5
Epoch 5: val_acc

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Define hyperparameters
embedding_vector_length = 32
lstm_units = 100
dropout_rate = 0.2
batch_size = 256

# Load model architecture
model = Sequential()
model.add(Embedding(input_dim=top_words,
                    output_dim=embedding_vector_length,
                    input_length=max_review_length,
                    name="embedding_layer"))
model.add(LSTM(units=lstm_units,
               dropout=dropout_rate,
               recurrent_dropout=dropout_rate,
               name="lstm_layer"))
model.add(Dense(units=1, activation='sigmoid', name="output_layer"))

# Load the best weights
weights_filepath = "weights_best.hdf5"
model.load_weights(weights_filepath)

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Evaluate the model
scores = model.evaluate(X_test, y_test, verbose=1, batch_size=batch_size)

# Calculate and print accuracy
accuracy_lstm = scores[1] * 100
print(f"Accuracy using LSTM: {accuracy_lstm:.2f}%")


Accuracy using LSTM: 72.07%


## CNN + LSTM

In [None]:
import time
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, LSTM, Dense
from tensorflow.keras.callbacks import ModelCheckpoint

# Define hyperparameters
embedding_vector_length = 32
conv1d_filters = 32
conv1d_kernel_size = 3
pool_size = 2
lstm_units = 100
num_epochs = 5
batch_size = 256

# Create the model
model = Sequential()
model.add(Embedding(input_dim=top_words,
                    output_dim=embedding_vector_length,
                    input_length=max_review_length,
                    name="embedding_layer"))
model.add(Conv1D(filters=conv1d_filters,
                 kernel_size=conv1d_kernel_size,
                 padding='same',
                 activation='relu',
                 name="conv1d_layer"))
model.add(MaxPooling1D(pool_size=pool_size, name="maxpooling1d_layer"))
model.add(LSTM(units=lstm_units, name="lstm_layer"))
model.add(Dense(units=1, activation='sigmoid', name="output_layer"))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print model summary
print(model.summary())

# Specify the path for model weights
weights_filepath="weights_best_cnn.hdf5"

# Set callbacks
checkpoint = ModelCheckpoint(weights_filepath,
                             monitor='val_accuracy',
                             verbose=1,
                             save_best_only=True,
                             mode='max',
                             save_weights_only=True)

callbacks_list = [checkpoint]

# Record the start time
start_time = time.time()

# Fit the model
model.fit(X_train,
          y_train,
          epochs=num_epochs,
          batch_size=batch_size,
          verbose=1,
          callbacks=callbacks_list,
          validation_data=(X_cv, y_cv))

# Record the end time
end_time = time.time()

# Calculate and print the time taken to train the model
training_time = end_time - start_time
print(f'The model took {training_time} seconds to train.')


Model: "sequential_13"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_layer (Embedding)  (None, 600, 32)          320000    
                                                                 
 conv1d_layer (Conv1D)       (None, 600, 32)           3104      
                                                                 
 maxpooling1d_layer (MaxPool  (None, 300, 32)          0         
 ing1D)                                                          
                                                                 
 lstm_layer (LSTM)           (None, 100)               53200     
                                                                 
 output_layer (Dense)        (None, 1)                 101       
                                                                 
Total params: 376,405
Trainable params: 376,405
Non-trainable params: 0
_______________________________________________

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, LSTM, Dense

# Define hyperparameters
embedding_vector_length = 32
conv1d_filters = 32
conv1d_kernel_size = 3
pool_size = 2
lstm_units = 100
batch_size = 256

# Load model architecture
model = Sequential()
model.add(Embedding(input_dim=top_words,
                    output_dim=embedding_vector_length,
                    input_length=max_review_length,
                    name="embedding_layer"))
model.add(Conv1D(filters=conv1d_filters,
                 kernel_size=conv1d_kernel_size,
                 padding='same',
                 activation='relu',
                 name="conv1d_layer"))
model.add(MaxPooling1D(pool_size=pool_size, name="maxpooling1d_layer"))
model.add(LSTM(units=lstm_units, name="lstm_layer"))
model.add(Dense(units=1, activation='sigmoid', name="output_layer"))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print model summary
print(model.summary())

# Load the best weights
weights_filepath = "weights_best_cnn.hdf5"
model.load_weights(weights_filepath)

# Evaluate the model
scores = model.evaluate(X_test, y_test, verbose=0)

# Calculate and print accuracy
accuracy_cnn_lstm = scores[1] * 100
print(f"Accuracy CNN using LSTM: {accuracy_cnn_lstm:.2f}%")


Model: "sequential_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_layer (Embedding)  (None, 600, 32)          320000    
                                                                 
 conv1d_layer (Conv1D)       (None, 600, 32)           3104      
                                                                 
 maxpooling1d_layer (MaxPool  (None, 300, 32)          0         
 ing1D)                                                          
                                                                 
 lstm_layer (LSTM)           (None, 100)               53200     
                                                                 
 output_layer (Dense)        (None, 1)                 101       
                                                                 
Total params: 376,405
Trainable params: 376,405
Non-trainable params: 0
_______________________________________________

## Resultados

In [None]:
table = PrettyTable()
table.field_names = ['Model', 'Accuracy']
table.add_row(['LSTM', accuracy_lstm])
table.add_row(['CNN using LSTM', accuracy_cnn_lstm])
print(table)

+----------------+-------------------+
|     Model      |      Accuracy     |
+----------------+-------------------+
|      LSTM      | 72.07175493240356 |
| CNN using LSTM | 72.07175493240356 |
+----------------+-------------------+


Los resultados muestran que el modelo CNN+LSTM tuvo un tiempo de entrenamiento significativamente más rápido que el modelo LSTM. El tiempo de entrenamiento del modelo CNN+LSTM fue de aproximadamente 264 segundos, mientras que el modelo LSTM tomó alrededor de 886 segundos (los tiempos se calcularon al momento de entrenar el modelo, revisar las secciones correspondientes).

Esto sucede debido a que el modelo CNN+LSTM, gracias a la capa CNN permite al modelo aprender características espaciales y temporales de manera más eficiente, en este caso en concreto ayudan a capturar patrones locales y a reducir la dimensaionalidad del texto.  

Con respecto a la precisión, ambos modelos son igual de eficientes, pero dado que el modelo CNN+LSTM logró obtener resultados comparables con una fracción del tiempo de entrenamiento requerido por el modelo LSTM, es evidente que el enfoque CNN+LSTM es más eficiente en términos de tiempo de entrenamiento en este caso en particular. (sucede exactamente lo mismo para el primer ejemplo, con el conjunto de datos en ingles)


