<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="500" align="center">


# Procesamiento de lenguaje natural
## Sentiment analysis con Embeddings + LSTM

### Objetivo
El objetivo es utilizar las críticas de compradores de ropa para que el sistema determine la evaluación del comprador y su crítica (cuantas estrellas le asigna al producto).

In [1]:
!pip install --upgrade --no-cache-dir gdown --quiet

In [2]:
import numpy as np
import random
import io
import pickle
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from keras.utils.np_utils import to_categorical
from tensorflow.keras.utils import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

### Datos
Utilizaremos como dataset críticas de compradores de ropa (eCommerce) los cuales puntuaron a cada prenda con un puntaje de 1 a 5 estrellas.\
Referencia del dataset: [LINK](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews/version/1)

In [3]:
# Descargar la carpeta de dataset
import os
import gdown
if os.access('clothing_ecommerce_reviews.csv', os.F_OK) is False:
    url = 'https://drive.google.com/uc?id=1Urn1UFSrodN5BuW6-sc_igtaySGRwhV8'
    output = 'clothing_ecommerce_reviews.csv'
    gdown.download(url, output, quiet=False)
else:
    print("El dataset ya se encuentra descargado")

Downloading...
From: https://drive.google.com/uc?id=1Urn1UFSrodN5BuW6-sc_igtaySGRwhV8
To: /content/clothing_ecommerce_reviews.csv
100%|██████████| 8.48M/8.48M [00:00<00:00, 23.5MB/s]


In [4]:
# Armar el dataset
df = pd.read_csv('clothing_ecommerce_reviews.csv')
df.drop(columns = ['Unnamed: 0'], inplace = True)
df.head()

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


### 1 - Limpieza de datos
Alumno:
- Del dataset unicamente utilizar las columnas "Review Text" y "Rating.
- Tranformar el rating 1-5 a una escala numérica de 0 a 4.



In [5]:
df_reviews = df.loc[:, ['Review Text', 'Rating']].dropna()
df_reviews['Rating'] = df_reviews['Rating'] - 1
df_reviews.head()

Unnamed: 0,Review Text,Rating
0,Absolutely wonderful - silky and sexy and comf...,3
1,Love this dress! it's sooo pretty. i happene...,4
2,I had such high hopes for this dress and reall...,2
3,"I love, love, love this jumpsuit. it's fun, fl...",4
4,This shirt is very flattering to all due to th...,4


In [78]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22641 entries, 0 to 23485
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Review Text  22641 non-null  object
 1   Rating       22641 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.0+ MB


In [9]:
len(df_reviews)

22641

In [11]:
df_reviews['Rating'].value_counts()

4    12540
3     4908
2     2823
1     1549
0      821
Name: Rating, dtype: int64

In [10]:
df_reviews['Rating'].value_counts()/len(df_reviews)

4    0.553862
3    0.216775
2    0.124685
1    0.068416
0    0.036262
Name: Rating, dtype: float64

In [12]:
# Alumno: tomar la columna de las review y almacenarlo todo en un vector numpy de reviews
type(df_reviews['Review Text'].values)

numpy.ndarray

In [6]:
reviews = df_reviews['Review Text'].values

In [9]:
# Alumno: Cuantas reviews (rows) hay para evaluar?
len(reviews)

22641

In [7]:
# Alumno: Concatenar todas las reviews para armar el corpus

corpus = ' '.join(reviews)

In [11]:
# Alumno: ¿Cuál es la longitud de ese corpus?
len(corpus)

7011643

In [8]:
# Alumno: Utilizar "text_to_word_sequence" para separar las palabras en tokens
# recordar que text_to_word_sequence automaticamente quita los signos de puntuacion y pasa el texto a lowercase
from keras.preprocessing.text import text_to_word_sequence

tokens=text_to_word_sequence(corpus)

In [13]:
# Alumno: Dar un vistazo a los primeros 20 tokens/palabras
tokens[:20]

['absolutely',
 'wonderful',
 'silky',
 'and',
 'sexy',
 'and',
 'comfortable',
 'love',
 'this',
 'dress',
 "it's",
 'sooo',
 'pretty',
 'i',
 'happened',
 'to',
 'find',
 'it',
 'in',
 'a']

In [14]:
# Alumno: ¿Cuántos tokens/palabras hay?
len(tokens)

1372203

In [9]:
# Alumno: Tokenizar las palabras con el Tokenizer de Keras
# Definir una máxima cantidad de palabras a utilizar:
# num_words --> the maximum number of words to keep, based on word frequency.
# Only the most common num_words-1 words will be kept.
from keras.preprocessing.text import Tokenizer
num_words = 2000
vocab_size = num_words

tokenizer = Tokenizer(num_words=2000) 
tokenizer.fit_on_texts(tokens) 


In [10]:
# Alumno: Obtener el diccionario de palabra (word) a índice
# y observar la cantidad total del vocabulario

word_index = tokenizer.word_index
len(word_index)

14847

In [11]:
# Alumno: Convertir las palabras/tokens a números
tokens_nums = tokenizer.texts_to_sequences(reviews)

In [18]:
tokens_nums[0]

[253, 532, 917, 3, 662, 3, 68]

In [12]:
# Alumno: Determinar cual es la oración más larga
max(len(s) for s in tokens_nums)

115

In [78]:
# Alumno: Realizar padding de las sentencias al mismo tamaño
# tomando de referencia la máxima sentencia
from tensorflow.keras.utils import pad_sequences
maxlen = 115

X = pad_sequences(tokens_nums, padding='pre', maxlen=maxlen)


In [22]:
# Alumno: Observar las dimensiones de la variable input
X.shape

(22641, 115)

In [13]:
def one_hot_encode(arr):
    n_values = np.max(arr) + 1
    return np.eye(n_values)[arr]

In [87]:
df_reviews['Rating'].head(5)

0    3
1    4
2    2
3    4
4    4
Name: Rating, dtype: int64

In [88]:
one_hot_encode(df_reviews['Rating'].head(5))

array([[0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.]])

In [81]:
# Alumno tomar la columna rating y alcemacenarla en una variable "y" transformada a oneHotEncoding
# Su shape debe ser equivalente la cantidad de rows del corpus y a la cantidad
# de clases que se deseen predecir (en este ejemplo son 5)

y  = one_hot_encode(df_reviews['Rating'])

In [90]:
y.shape

(22641, 5)

In [15]:
# Alumno: Dividir los datos en train y test
from sklearn.model_selection import train_test_split

Para reducir el desbalanceo voy a achicar mi conjunto de entrenamiento eliminando de manera random instancias de las clases mas pobladas

In [138]:
type(X)

numpy.ndarray

In [140]:
type(df_reviews['Rating'])

pandas.core.series.Series

In [79]:
df_nuevo  = pd.DataFrame(X)

In [82]:
ys = [f'y_{i}' for i in range(5)]
for i, col in enumerate(ys):
    df_nuevo[col] = y[:, i]

In [145]:
df_nuevo.shape

(22641, 120)

In [83]:
df_nuevo['Rating'] = df_reviews['Rating']

In [158]:
for i in range(5):
  print(len(df_nuevo[df_nuevo['Rating']== i ]['Rating']) /df_nuevo.shape[0])

0.035201625369904156
0.06550064043107637
0.11991519809195707
0.20802968066781502
0.5350470385583675


In [91]:
cantidad_4 = int(cantidad_4 / 5)
cantidad_4

2422

In [92]:
sample_to_remove = df_nuevo[df_nuevo['Rating'] == 4].sample(n=len(df_nuevo[df_nuevo['Rating']== 4 ]['Rating']) - cantidad_4, random_state=42)

In [93]:
df_nuevo = df_nuevo.drop(sample_to_remove.index)

In [94]:
df_nuevo.shape[0]

12949

In [95]:
for i in range(5):
  print(len(df_nuevo[df_nuevo['Rating']== i ]['Rating']) /df_nuevo.shape[0])

0.061549154374855204
0.11452621824079079
0.20966870028573634
0.3637346513244266
0.1870414703838134


In [98]:
cantidad_3 = len(df_nuevo[df_nuevo['Rating']== 3 ]['Rating'])
quinto_de_3 = int(cantidad_3 * 0.3)
sample_to_remove = df_nuevo[df_nuevo['Rating'] == 3].sample(n=quinto_de_3, random_state=42)
df_nuevo = df_nuevo.drop(sample_to_remove.index)
df_nuevo.shape[0]

10877

In [99]:
for i in range(5):
  print(len(df_nuevo[df_nuevo['Rating']== i ]['Rating']) /df_nuevo.shape[0])

0.07327388066562471
0.1363427415647697
0.24960926726119334
0.24253010940516687
0.2226716925622874


Emparejé bastante el desbalanceo

In [168]:
df_nuevo.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,111,112,113,114,y_0,y_1,y_2,y_3,y_4,Rating
0,0,0,0,0,0,0,0,0,0,0,...,3,662,3,68,0.0,0.0,0.0,1.0,0.0,3.0
2,0,0,0,0,0,0,0,0,0,0,...,1,469,5,688,0.0,0.0,1.0,0.0,0.0,2.0
3,0,0,0,0,0,0,0,0,0,0,...,533,10,34,210,0.0,0.0,0.0,0.0,1.0,4.0
5,0,0,0,0,0,0,0,0,0,0,...,2,358,7,18,0.0,1.0,0.0,0.0,0.0,1.0
7,0,0,0,0,0,0,0,0,0,0,...,9,1354,1689,22,0.0,0.0,0.0,1.0,0.0,3.0


In [100]:
X = df_nuevo.iloc[:, :115]
X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,105,106,107,108,109,110,111,112,113,114
2,0,0,0,0,0,0,0,0,0,0,...,15,1,151,475,830,313,1,469,5,688
3,0,0,0,0,0,0,0,0,0,0,...,183,2,32,5,2,115,533,10,34,210
5,0,0,0,0,0,0,0,0,0,0,...,39,131,19,102,11,31,2,358,7,18
7,0,0,0,0,0,0,0,0,0,0,...,187,47,6,455,450,62,9,1354,1689,22
8,0,0,0,0,0,0,0,0,0,0,...,58,3,385,14,1,236,103,1323,12,133
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22632,0,0,0,0,0,0,0,0,0,0,...,205,290,43,49,40,216,8,24,50,840
22634,0,0,0,0,0,0,0,0,0,0,...,764,3,790,1,83,34,2,65,8,261
22635,0,0,0,0,0,0,0,0,0,0,...,4,335,510,71,4,148,948,26,10,1
22636,0,0,0,0,0,0,0,0,0,0,...,11,3,89,4,23,58,120,3,47,1156


In [101]:
y = df_nuevo.iloc[:, 115:120]
y

Unnamed: 0,y_0,y_1,y_2,y_3,y_4
2,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0
5,0.0,1.0,0.0,0.0,0.0
7,0.0,0.0,0.0,1.0,0.0
8,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...
22632,0.0,0.0,0.0,1.0,0.0
22634,1.0,0.0,0.0,0.0,0.0
22635,0.0,0.0,0.0,0.0,1.0
22636,0.0,0.0,0.0,0.0,1.0


Ahora separo en train y test

In [102]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=42 )

### 2 - Entrenar el modelo con Embeddings + LSTM

In [27]:
# Alumno: Entrene su modelo con LSTM entrenando sus propios embeddings
# o utilizando embeddings pre-entrenados.
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout

In [28]:
import os
import gdown
if os.access('fasttext.pkl', os.F_OK) is False:
    url = 'https://drive.google.com/uc?id=1KU5qmAYh3LATMvVgocFDfW-PK3prm1WU&export=download'
    output = 'fasttext.pkl'
    gdown.download(url, output, quiet=False)
else:
    print("Los embeddings fasttext.pkl ya están descargados")

Downloading...
From (uriginal): https://drive.google.com/uc?id=1KU5qmAYh3LATMvVgocFDfW-PK3prm1WU&export=download
From (redirected): https://drive.google.com/uc?id=1KU5qmAYh3LATMvVgocFDfW-PK3prm1WU&export=download&confirm=t&uuid=960fc076-325a-46cc-af4e-8498530c83f6
To: /content/fasttext.pkl
100%|██████████| 2.88G/2.88G [00:12<00:00, 222MB/s]


In [29]:
import logging
import os
from pathlib import Path
from io import StringIO
import pickle

class WordsEmbeddings(object):
    logger = logging.getLogger(__name__)

    def __init__(self):
        # load the embeddings
        words_embedding_pkl = Path(self.PKL_PATH)
        if not words_embedding_pkl.is_file():
            words_embedding_txt = Path(self.WORD_TO_VEC_MODEL_TXT_PATH)
            assert words_embedding_txt.is_file(), 'Words embedding not available'
            embeddings = self.convert_model_to_pickle()
        else:
            embeddings = self.load_model_from_pickle()
        self.embeddings = embeddings
        # build the vocabulary hashmap
        index = np.arange(self.embeddings.shape[0])
        # Dicctionarios para traducir de embedding a IDX de la palabra
        self.word2idx = dict(zip(self.embeddings['word'], index))
        self.idx2word = dict(zip(index, self.embeddings['word']))

    def get_words_embeddings(self, words):
        words_idxs = self.words2idxs(words)
        return self.embeddings[words_idxs]['embedding']

    def words2idxs(self, words):
        return np.array([self.word2idx.get(word, -1) for word in words])

    def idxs2words(self, idxs):
        return np.array([self.idx2word.get(idx, '-1') for idx in idxs])

    def load_model_from_pickle(self):
        self.logger.debug(
            'loading words embeddings from pickle {}'.format(
                self.PKL_PATH
            )
        )
        max_bytes = 2**28 - 1 # 256MB
        bytes_in = bytearray(0)
        input_size = os.path.getsize(self.PKL_PATH)
        with open(self.PKL_PATH, 'rb') as f_in:
            for _ in range(0, input_size, max_bytes):
                bytes_in += f_in.read(max_bytes)
        embeddings = pickle.loads(bytes_in)
        self.logger.debug('words embeddings loaded')
        return embeddings

    def convert_model_to_pickle(self):
        # create a numpy strctured array:
        # word     embedding
        # U50      np.float32[]
        # word_1   a, b, c
        # word_2   d, e, f
        # ...
        # word_n   g, h, i
        self.logger.debug(
            'converting and loading words embeddings from text file {}'.format(
                self.WORD_TO_VEC_MODEL_TXT_PATH
            )
        )
        structure = [('word', np.dtype('U' + str(self.WORD_MAX_SIZE))),
                     ('embedding', np.float32, (self.N_FEATURES,))]
        structure = np.dtype(structure)
        # load numpy array from disk using a generator
        with open(self.WORD_TO_VEC_MODEL_TXT_PATH, encoding="utf8") as words_embeddings_txt:
            embeddings_gen = (
                (line.split()[0], line.split()[1:]) for line in words_embeddings_txt
                if len(line.split()[1:]) == self.N_FEATURES
            )
            embeddings = np.fromiter(embeddings_gen, structure)
        # add a null embedding
        null_embedding = np.array(
            [('null_embedding', np.zeros((self.N_FEATURES,), dtype=np.float32))],
            dtype=structure
        )
        embeddings = np.concatenate([embeddings, null_embedding])
        # dump numpy array to disk using pickle
        max_bytes = 2**28 - 1 # # 256MB
        bytes_out = pickle.dumps(embeddings, protocol=pickle.HIGHEST_PROTOCOL)
        with open(self.PKL_PATH, 'wb') as f_out:
            for idx in range(0, len(bytes_out), max_bytes):
                f_out.write(bytes_out[idx:idx+max_bytes])
        self.logger.debug('words embeddings loaded')
        return embeddings


class GloveEmbeddings(WordsEmbeddings):
    WORD_TO_VEC_MODEL_TXT_PATH = 'glove.twitter.27B.50d.txt'
    PKL_PATH = 'gloveembedding.pkl'
    N_FEATURES = 50
    WORD_MAX_SIZE = 60


class FasttextEmbeddings(WordsEmbeddings):
    WORD_TO_VEC_MODEL_TXT_PATH = 'cc.en.300.vec'
    PKL_PATH = 'fasttext.pkl'
    N_FEATURES = 300
    WORD_MAX_SIZE = 60

In [30]:
model_fasttext = FasttextEmbeddings()

In [31]:
# Crear la Embedding matrix

print('preparing embedding matrix...')
embed_dim = 300 # fasttext
words_not_found = []

# word_index provieen del tokenizer

nb_words = min(num_words, len(corpus)) # vocab_size
embedding_matrix = np.zeros((nb_words, embed_dim))
for word, i in word_index.items():
    if i >= nb_words:
        continue
    embedding_vector = model_fasttext.get_words_embeddings(word)[0]
    if (embedding_vector is not None) and len(embedding_vector) > 0:
        embedding_matrix[i] = embedding_vector
    else:
        # words not found in embedding index will be all-zeros.
        words_not_found.append(word)
        print(word)

print('number of null word embeddings:', np.sum(np.sum(embedding_matrix, axis=1) == 0))

preparing embedding matrix...
number of null word embeddings: 2


In [180]:
words_not_found

[]

In [32]:
model = Sequential()

model.add(Embedding(input_dim = vocab_size,
                    output_dim = embed_dim,
                    input_length = 115,
                    weights = [embedding_matrix],
                    trainable =False))

model.add(LSTM(64, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(64, return_sequences = False))
model.add(Dropout(0.2))
model.add(Dense(32, activation = 'relu'))
model.add(Dropout(0.2))
model.add(Dense(5, activation = 'softmax'))

model.compile(optimizer ="adam", 
              loss = "binary_crossentropy", 
              metrics=['accuracy', ]) 




In [33]:
hist = model.fit(X_train, y_train, epochs = 7, validation_split =0.2)

Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7


In [60]:
X_test[:1].shape

(1, 115)

In [63]:
r = np.zeros(5) # a.size if a is a numpy array
r[np.argmax(model.predict(X_test[:1]))]=1
r



array([0., 0., 0., 0., 1.])

In [64]:
y_test[:1]

Unnamed: 0,y_0,y_1,y_2,y_3,y_4
12760,0.0,0.0,0.0,1.0,0.0


In [39]:
from sklearn.metrics import multilabel_confusion_matrix
from sklearn.metrics import confusion_matrix

In [34]:
y_prediction = model.predict(X_test)




In [66]:
y_prediction.shape

(4496, 5)

In [36]:
binary_predictions = np.zeros_like(y_prediction)

In [68]:
binary_predictions.shape

(4496, 5)

In [37]:
for i in range(4496):
  binary_predictions[i][np.argmax(y_prediction[i])] = 1

In [81]:
todos = True
for i in range(4496):
  todos = todos * (sum(binary_predictions[i] == 1))
todos

1

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import itertools


In [40]:
#Create confusion matrix and normalizes it over predicted (columns)
result = multilabel_confusion_matrix(y_test, binary_predictions)
result

array([[[4321,    0],
        [ 175,    0]],

       [[4211,    0],
        [ 285,    0]],

       [[3925,    0],
        [ 571,    0]],

       [[3512,    0],
        [ 984,    0]],

       [[   0, 2015],
        [   0, 2481]]])

In [41]:
result.shape

(5, 2, 2)

In [42]:
model = Sequential()

model.add(Embedding(input_dim = vocab_size,
                    output_dim = embed_dim,
                    input_length = 115,
                    weights = [embedding_matrix],
                    trainable =False))

model.add(LSTM(64, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(128, return_sequences = False))
model.add(Dropout(0.2))
model.add(Dense(32, activation = 'relu'))
model.add(Dropout(0.2))
model.add(Dense(5, activation = 'softmax'))

model.compile(optimizer ="adam", 
              loss = "binary_crossentropy", 
              metrics=['accuracy' ]) 




In [43]:
hist = model.fit(X_train, y_train, epochs = 7, validation_split =0.2)

Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7


In [44]:
y_prediction = model.predict(X_test)




In [45]:
binary_predictions = np.zeros_like(y_prediction)

In [46]:
for i in range(4496):
  binary_predictions[i][np.argmax(y_prediction[i])] = 1

In [47]:
#Create confusion matrix and normalizes it over predicted (columns)
result = multilabel_confusion_matrix(y_test, binary_predictions)
result

array([[[4321,    0],
        [ 175,    0]],

       [[4211,    0],
        [ 285,    0]],

       [[3917,    8],
        [ 561,   10]],

       [[3512,    0],
        [ 984,    0]],

       [[  17, 1998],
        [   1, 2480]]])

In [69]:
model = Sequential()

model.add(Embedding(input_dim = vocab_size,
                    output_dim = embed_dim,
                    input_length = 115,
                    weights = [embedding_matrix],
                    trainable =False))

model.add(LSTM(128, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(512, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(64, return_sequences = False))
model.add(Dropout(0.2))
model.add(Dense(32, activation = 'relu'))
model.add(Dropout(0.2))
model.add(Dense(5, activation = 'softmax'))

model.compile(optimizer ="adam", 
              loss = "binary_crossentropy", 
              metrics=['accuracy']) 




In [70]:
hist = model.fit(X_train, y_train, epochs = 15, validation_split =0.2)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [71]:
y_prediction = model.predict(X_test)




In [72]:
binary_predictions = np.zeros_like(y_prediction)

In [73]:
for i in range(4496):
  binary_predictions[i][np.argmax(y_prediction[i])] = 1

In [74]:
#Create confusion matrix and normalizes it over predicted (columns)
result = multilabel_confusion_matrix(y_test, binary_predictions)
result

array([[[4321,    0],
        [ 175,    0]],

       [[4211,    0],
        [ 285,    0]],

       [[3604,  321],
        [ 418,  153]],

       [[3467,   45],
        [ 975,    9]],

       [[ 413, 1602],
        [ 115, 2366]]])

balancee el dataset un poco mas

In [118]:
model = Sequential()

model.add(Embedding(input_dim = vocab_size,
                    output_dim = embed_dim,
                    input_length = 115,
                    weights = [embedding_matrix],
                    trainable =False))

model.add(LSTM(64, return_sequences = True))
model.add(Dropout(0.2))
model.add(LSTM(64, return_sequences = False))
model.add(Dropout(0.2))
#model.add(LSTM(64, return_sequences = False))
#model.add(Dropout(0.2))
model.add(Dense(128, activation = 'relu'))
model.add(Dropout(0.2))
model.add(Dense(5, activation = 'softmax'))

model.compile(optimizer ="adam", 
              loss = "binary_crossentropy", 
              metrics=['accuracy']) 




In [119]:
hist = model.fit(X_train, y_train, epochs = 15, validation_split =0.2)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


In [120]:
y_prediction = model.predict(X_test)




In [121]:
binary_predictions = np.zeros_like(y_prediction)

In [123]:
for i in range(3590):
  binary_predictions[i][np.argmax(y_prediction[i])] = 1

In [124]:
#Create confusion matrix and normalizes it over predicted (columns)
result = multilabel_confusion_matrix(y_test, binary_predictions)
result

array([[[3482,    0],
        [ 108,    0]],

       [[3346,    0],
        [ 244,    0]],

       [[2809,  349],
        [ 295,  137]],

       [[2701,   82],
        [ 771,   36]],

       [[ 459, 1132],
        [ 145, 1854]]])

Luego de haber probado distintas combinaciones entre redes y tamaños del dataset con el fin de obtener un dataset mas balanceado, no pude obtener ningún resultado considerabalmente positivo. Es necesario de mayor cantidad de datos para poder mejorar la solución a este problema de clasificación multiclases.