### Tabla de contenidos

1. [**Importación de librerías**](#1.-Importación-de-librerías)   
2. [**Cargo los datos**](#2.-Cargo-los-datos) 
3. [**Limpieza de los datos**](#3.-Limpieza-de-los-datos)  
    3.1 [**Checkeo los duplicados**](#3.1-Checkeo-los-duplicados)   
    3.2 [**Compruebo si existen reseñas vacías**](#3.2-Compruebo-si-existen-reseñas-vacías)  
    3.3 [**Distribución de los datos**](#3.3-Distribución-de-los-datos)
4. [**Preprocesamiento del texto**](#4.-Preprocesamiento-del-texto)  
5. [**Normalización de los datos**](#5.-Normalización-de-los-datos)    
    5.1 [**Normalizo el puntaje de las reseñas para que varíe entre 0 y 4**](#5.1-Normalizo-el-puntaje-de-las-reseñas-para-que-varíe-entre-0-y-4)  
    5.2 [**Elimino columnas que no van a ser utilizadas**](#5.2-Elimino-columnas-que-no-van-a-ser-utilizadas)  
    5.3 [**Elimino textos que pueden haber quedado vacíos luego de preprocesar el texto**](#5.3-Elimino-textos-que-pueden-haber-quedado-vacíos-luego-de-preprocesar-el-texto)  
    5.4 [**Separo los datos en dos conjuntos: train y test (80/20)**](#5.4-Separo-los-datos-en-dos-conjuntos:-train-y-test-(80/20))  
6. [**Creación del modelo**](#6.-Creación-del-modelo)   
    6.1 [**Uso LSTM bidireccional con Embedding layer**](#6.1-Uso-LSTM-bidireccional-con-Embedding-layer)   
    6.2 [**Entreno el modelo**](#6.2-Entreno-el-modelo)   

### 1. Importación de librerías

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

import re
from bs4 import BeautifulSoup
from tqdm import tqdm
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,f1_score, confusion_matrix

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense , Input , LSTM , Embedding, Dropout , Activation, GRU, Flatten
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model, Sequential
from keras.layers import Convolution1D
from keras import initializers, regularizers, constraints, optimizers, layers

### 2. Cargo los datos

In [None]:
df = pd.read_csv("/kaggle/input/amazon-product-reviews/Reviews.csv")
print("Tamaño de los datos: ", df.shape)

In [None]:
df.head()

### 3. Limpieza de los datos

#### 3.1 Checkeo los duplicados

In [None]:
df=df.sort_values('ProductId', kind='quicksort', na_position='last')

In [None]:
df=df.drop_duplicates(subset={"Text"}, keep='first', inplace=False)
df.shape

#### 3.2 Compruebo si existen reseñas vacías

In [None]:
print(df['Text'].isnull().sum())
df['Score'].isnull().sum()

#### 3.3 Distribución de los datos

In [None]:
df['Score'].value_counts()

In [None]:
plt.figure(figsize = (10,7))
sns.countplot(df['Score'])
plt.title("Distribución de la puntuación")

### 4. Preprocesamiento del texto

A continuación se realizarán distintas operaciones para:

* Remover links de sitios web
* Remover tags html
* Descontracturar palabras
* Remover palabras con números Removing the words with numeric digits
* Remover caracteres especiales Removing non-word characters
* Convertir el texto a minúscula
* Remover las stop words
* Aplicar lematización

In [None]:
def decontract(text):
    text = re.sub(r"won\'t", "will not", text)
    text = re.sub(r"can\'t", "can not", text)
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    return text

In [None]:
stop_words = set(stopwords.words('english'))
negative_stop_words = set(word for word in stop_words if "n't" in word or 'no' in word)
stop_words = stop_words - negative_stop_words

In [None]:
lemmatizer = WordNetLemmatizer()
def preprocess_text(review):
    review = re.sub(r"http\S+", "", review)             # removing website links
    review = BeautifulSoup(review, 'lxml').get_text()   # removing html tags
    review = decontract(review)                         # decontracting
    review = re.sub("\S*\d\S*", "", review).strip()     # removing the words with numeric digits
    review = re.sub('[^A-Za-z]+', ' ', review)          # removing non-word characters
    review = review.lower()                             # converting to lower case
    review = [word for word in review.split(" ") if not word in stop_words] # removing stop words
    review = [lemmatizer.lemmatize(token, "v") for token in review] #lemmatization
    review = " ".join(review)
    review.strip()
    return review
df['Text'] = df['Text'].apply(lambda x: preprocess_text(x))

In [None]:
df['Text'].head()

### 5. Normalización de los datos

#### 5.1 Normalizo el puntaje de las reseñas para que varíe entre 0 y 4

In [None]:
def normalize(score):
    return score - 1

In [None]:
df["Score"] = df["Score"].apply(normalize)

#### 5.2 Elimino columnas que no van a ser utilizadas

In [None]:
df = df.drop(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Time', 'Summary'], axis=1)

#### 5.3 Elimino textos que pueden haber quedado vacíos luego de preprocesar el texto

In [None]:
df['Text'].replace('', np.nan, inplace=True)
df.dropna(subset=['Text'], inplace=True)

In [None]:
df.head()

#### 5.4 Separo los datos en dos conjuntos: train y test (80/20)

In [None]:
train_df, test_df = train_test_split(df, test_size = 0.2, random_state = 42)
print("Training data size: ", train_df.shape)
print("Test data size: ", test_df.shape)

### 6. Creación del modelo

#### 6.1 Uso LSTM bidireccional con Embedding layer

In [None]:
top_words = 6000
tokenizer = Tokenizer(num_words=top_words)
tokenizer.fit_on_texts(train_df['Text'])
list_tokenized_train = tokenizer.texts_to_sequences(train_df['Text'])

vocab_size = len(tokenizer.word_index) + 1
max_review_length = 100
X_train = pad_sequences(list_tokenized_train, maxlen=max_review_length)
y_train = train_df['Score']

In [None]:
from numpy import array, asarray, zeros

embeddings_dictionary = dict()

glove_file = open('/kaggle/input/glove6b100dtxt/glove.6B.100d.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions
glove_file.close()

embedding_matrix = zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

In [None]:
from keras.layers import Bidirectional
model = Sequential()
model.add(Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_review_length))
model.add(Bidirectional(LSTM(100)))
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

In [None]:
from keras.utils import plot_model
plot_model(model, to_file='model_plot4a.png', show_shapes=True, show_layer_names=True)

In [None]:
import keras
y_train = keras.utils.to_categorical(y_train)

#### 6.2 Entreno el modelo

In [None]:
history = model.fit(X_train, y_train, batch_size=128, epochs=5, verbose=1, validation_split=0.2)

In [None]:
history2 = model.fit(X_train, y_train, batch_size=32, epochs=5, verbose=1, validation_split=0.2)

In [None]:
history3 = model.fit(X_train, y_train, batch_size=64, epochs=5, verbose=1, validation_split=0.2)

In [None]:
history4 = model.fit(X_train, y_train, batch_size=16, epochs=5, verbose=1, validation_split=0.2)