# Sentiment Analysis 😃😢🤬

[Referencia](https://colab.research.google.com/drive/1ysEKrw_LE2jMndo1snrZUh5w87LQsCxk#forceEdit=true&sandboxMode=true&scrollTo=Fo3WY-e86zX2)

### 🍿Movie Review Dataset:
This dataset contains 25,000 reviews from IMDB where each one is already preprocessed and has a label as either positive or negative. Each review is encoded by integers that represents how common a word is in the entire dataset. For example, a word encoded by the integer 3 means that it is the 3rd most common word in the dataset.
 




In [15]:
from keras.datasets import imdb
import keras
import tensorflow as tf
import os
import numpy as np

In [16]:
VOCAB_SIZE = 88584
BATCH_SIZE = 64

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = VOCAB_SIZE)

Como dice la descripcion del dataset:

> Cada $y \in Y | type(y) = list<\N>$<br>
> Sea $y$ = train_data[1] $\land$ $y[0] = 2 \rightarrow$ la palabra en y[0] es la segunda mas comun del dataset.


In [17]:
train_data[1][:10]

[1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463]

Sin embargo, no todas las reviews tienen el mismo tamaño, pero la red solo puede procesar datos de un tamaño fijo.<br>
Entonces preprocesamos los inputs:

* if the review is greater than 250 words then trim off the extra words <br>
* if the review is less than 250 words add the necessary amount of 0's to make it equal to 250.

In [23]:
MAXLEN = 250
train_data = keras.utils.pad_sequences(train_data, MAXLEN)
test_data = keras.utils.pad_sequences(test_data, MAXLEN)

len(train_data[1]) == MAXLEN

True

### Armo el modelo 🤖

#### RNNs vs LSTM

> **Simple Recurrent Neural Networks (simpleRNN)**: A recurrent layer processes words or input one at a time in a combination with the output from the previous iteration. This will allow it to treat words differently based on their order in a sentence and to slowly build an understanding of the entire input, one word at a time. 
It will process one word at a time while maintaining an internal memory of what it's already seen.

>**Long Short-Term Memory (LSTM)**: This layer works very similarily to the simpleRNN layer but adds a way to access inputs from any timestep in the past. With a LSTM we have a long-term memory data structure storing all the previously seen inputs as well as when we saw them.
 

In [24]:
OUTPUT_DIM_EMBEDDING = 32 # dimension of the vectors generated by the embedding layer

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(VOCAB_SIZE, OUTPUT_DIM_EMBEDDING),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation="sigmoid")
])


### Compilamos y entrenamos el modelo

In [25]:
model.compile(loss="binary_crossentropy",optimizer="rmsprop",metrics=['acc'])
history = model.fit(train_data, train_labels, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Evaluamos el modelo

In [27]:
test_loss, test_acc = model.evaluate(test_data, test_labels)
print('\nTest accuracy:', test_acc)
print('\nTest loss:', test_loss)


Test accuracy: 0.85343998670578

Test loss: 0.5242786407470703


### Predicciones

In [46]:
word_index = imdb.get_word_index()

def encode_text(text):
  # Pasar el texto a formato de secuencia de palabras
  # con el mismo encoding que se usó para entrenar el modelo
  tokens = keras.preprocessing.text.text_to_word_sequence(text)
  tokens = [word_index[word] if word in word_index else 0 for word in tokens]
  return keras.utils.pad_sequences([tokens], MAXLEN)[0]

text = "that movie was just amazing, so amazing"
encoded = encode_text(text)

In [41]:
def predict(text):
    # codifico el texto
    encoded = encode_text(text)
    # armo un X valido con el texto codificado
    X = np.zeros((1,250))
    X[0] = encoded
    # predigo    
    result = model.predict(X) 
    print(result[0])

In [42]:
positive_review = "That movie was! really loved it and would great watch it again because it was amazingly great"
predict(positive_review)

negative_review = "that movie really sucked. I hated it and wouldn't watch it again. Was one of the worst things I've ever watched"
predict(negative_review)

[0.9774961]
[0.46915933]


### Pruebas

In [49]:
L1 = tf.keras.layers.Embedding(VOCAB_SIZE, 32)
A1 = L1(encoded)
print(A1.shape)

(250, 32)
