# Atividade 4 - NLP

O objetivo da segunda atividade é implementa conceitos do processamento de linguagem natural, para isso foi utlizado um dataset da amazon, no qual possui a review do usuário sobre um produto e a avalição de usuário. 

## Preparando o Ambiente

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
import tensorflow as tf

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

O target desse problema era uma nota entre 0 e 5. Para simplificar o problema o target foi reduzido para 0 e 1, com 0 sendo a avalição negativa (rating<4) e 1 sendo a avaliação positiva (rating>=4).

In [2]:
df = pd.read_csv('/kaggle/input/amazon-music-reviews/Musical_instruments_reviews.csv')
df.dropna(axis=0)
train, test = train_test_split(df, test_size = 0.1, random_state=42)
train_x = train['reviewText']
train_y = train['overall']
test_x = test['reviewText']
test_y = test['overall']

def good_or_bad(rating):
    if rating in [4,5]:
        return 1
    else:
        return 0
    
train_y = train_y.apply(good_or_bad)
test_y = test_y.apply(good_or_bad)

## Pré-processamento

Em seguida realizamos o pré-processamento dos dados, nessa etapa removemos as stopwords e aplicamos lowercasing em todos os textos.

In [3]:
import nltk
import re
nltk.download('stopwords')

def pre_processamento(data):
    df = pd.DataFrame(columns=['text'])
    for texto in data:
        try:
            letras_min =  re.findall(r'\b[A-zÀ-úü]+\b', texto.lower())

            stopwords = nltk.corpus.stopwords.words('english')
            stop = set(stopwords)
            sem_stopwords = [w for w in letras_min if w not in stop]

            texto_limpo = f'{" ".join(sem_stopwords)}'

            nova_linha = pd.Series([texto_limpo], index=['text'])
            df = pd.concat([df, nova_linha.to_frame().T], ignore_index=True)
        except:
            nova_linha = pd.Series(['nan'], index=['text'])
            df = pd.concat([df, nova_linha.to_frame().T], ignore_index=True)
        
    return df

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
train_x = pre_processamento(train_x)

test_x = pre_processamento(test_x)

In [5]:
print(train_x.head())
print(test_x.head())

                                                text
0  work well nice low profile easy push button tr...
1  loss initial tone pedal priority list love sou...
2  friend mine guitar repair genus budding luthie...
3  works great without polish good size ziplock b...
4  given violin fiddle friend decided would never...
                                                text
0  using acoustic guitars straps decade sturdy co...
1  sounds like great concept seem well made care ...
2  recently ordered wide variety picks find ones ...
3  two stands electric guitar bass version acoust...
4  guitar sounds awesome stays tune well fan taka...


In [6]:
vocab_size = 15000
embedding_dim = 16
max_length = 1081

## Aplicando a Tokenização

Nessa etapa convertemos os textos para números por meio da tokenização, assim é possível aplicar o dado no modelo.

In [11]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=vocab_size,oov_token="<OOV>")
tokenizer.fit_on_texts(train_x['text'])

word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(train_x['text'])
training_padded = pad_sequences(training_sequences, padding='post', maxlen=max_length,truncating='post')

test_sequences = tokenizer.texts_to_sequences(test_x['text'])
test_padded = pad_sequences(test_sequences, padding='post',maxlen=max_length,truncating='post')

In [7]:
from tensorflow.keras.metrics import Recall, Precision

## Criando o Modelo

Agora realizamos a criação de um modelo simples.

In [9]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=[Recall(), 'accuracy', Precision()])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 1081, 16)          240000    
                                                                 
 global_average_pooling1d (G  (None, 16)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 24)                408       
                                                                 
 dense_1 (Dense)             (None, 1)                 25        
                                                                 
Total params: 240,433
Trainable params: 240,433
Non-trainable params: 0
_________________________________________________________________


In [10]:
history = model.fit(training_padded, train_y, epochs=10, validation_data=(test_padded, test_y), verbose=2)

Epoch 1/10
289/289 - 4s - loss: 0.4497 - recall: 0.9909 - accuracy: 0.8731 - precision: 0.8799 - val_loss: 0.3653 - val_recall: 1.0000 - val_accuracy: 0.8802 - val_precision: 0.8802 - 4s/epoch - 14ms/step
Epoch 2/10
289/289 - 2s - loss: 0.3674 - recall: 1.0000 - accuracy: 0.8791 - precision: 0.8791 - val_loss: 0.3654 - val_recall: 1.0000 - val_accuracy: 0.8802 - val_precision: 0.8802 - 2s/epoch - 8ms/step
Epoch 3/10
289/289 - 2s - loss: 0.3668 - recall: 1.0000 - accuracy: 0.8791 - precision: 0.8791 - val_loss: 0.3643 - val_recall: 1.0000 - val_accuracy: 0.8802 - val_precision: 0.8802 - 2s/epoch - 7ms/step
Epoch 4/10
289/289 - 2s - loss: 0.3659 - recall: 1.0000 - accuracy: 0.8791 - precision: 0.8791 - val_loss: 0.3640 - val_recall: 1.0000 - val_accuracy: 0.8802 - val_precision: 0.8802 - 2s/epoch - 8ms/step
Epoch 5/10
289/289 - 2s - loss: 0.3651 - recall: 1.0000 - accuracy: 0.8791 - precision: 0.8791 - val_loss: 0.3628 - val_recall: 1.0000 - val_accuracy: 0.8802 - val_precision: 0.8802 -

Esse modelo acabou tendo 88% de acuracia no dado de teste.

----

## Aplicando o LSTM

Agora utilizamos um modelo de LSTM bidirecional para tentar obter um melhor resultado.

In [8]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size,64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=[Recall(), 'accuracy', Precision()])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 64)          960000    
                                                                 
 bidirectional (Bidirectiona  (None, 128)              66048     
 l)                                                              
                                                                 
 dense (Dense)               (None, 64)                8256      
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 1,034,369
Trainable params: 1,034,369
Non-trainable params: 0
_________________________________________________________________


In [12]:
history = model.fit(training_padded, train_y, epochs=10, validation_data=(test_padded, test_y), verbose=2)

Epoch 1/10
289/289 - 79s - loss: 0.3576 - recall: 0.9963 - accuracy: 0.8778 - precision: 0.8805 - val_loss: 0.3006 - val_recall: 0.9956 - val_accuracy: 0.8851 - val_precision: 0.8876 - 79s/epoch - 274ms/step
Epoch 2/10
289/289 - 49s - loss: 0.2218 - recall: 0.9766 - accuracy: 0.9185 - precision: 0.9337 - val_loss: 0.2998 - val_recall: 0.9768 - val_accuracy: 0.8929 - val_precision: 0.9084 - 49s/epoch - 168ms/step
Epoch 3/10
289/289 - 38s - loss: 0.1246 - recall: 0.9852 - accuracy: 0.9579 - precision: 0.9675 - val_loss: 0.3761 - val_recall: 0.9823 - val_accuracy: 0.8870 - val_precision: 0.8988 - 38s/epoch - 132ms/step
Epoch 4/10
289/289 - 35s - loss: 0.0701 - recall: 0.9915 - accuracy: 0.9784 - precision: 0.9841 - val_loss: 0.4162 - val_recall: 0.9381 - val_accuracy: 0.8647 - val_precision: 0.9108 - 35s/epoch - 120ms/step
Epoch 5/10
289/289 - 30s - loss: 0.0374 - recall: 0.9956 - accuracy: 0.9885 - precision: 0.9914 - val_loss: 0.5298 - val_recall: 0.9325 - val_accuracy: 0.8608 - val_pre

Com esse modelo consegimos ótimos resultados, a acurácia no treinamento chegou em 99% e no teste chegou em 90%