https://vgpena.github.io/classifying-tweets-with-keras-and-tensorflow/

En el anterior enlace, tenéis un ejemplo sobre cómo, a partir de tweets con un label específico (un sentimiento, positivo o negativo): 

1. Genera un conjunto de entrenamiento. El conjunto de entrenamiento es formado a partir de tweets completos pasados a un array con un tamaño específico.
2. Ese array (X_train de tamaño N) tiene un label que representa el sentimiento (y_train)
3. Como todas las frases tienen un tamaño N, la entrada de la red neuronal será de tamaño N y la salida de la red será de tamaño 2 usando activación softmax(porque hay dos clases).

Se pide: 

- Realizar un clasificador de reviews para el dataset de IMDB de la carpeta data_exercise/

**Cuando usa la importación "keras.x", reemplázalo por "tensorflow.keras.x"**

In [101]:
import pandas as pd
import numpy as np
import re
import string
import tensorflow as tf

from sklearn.preprocessing import LabelEncoder

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D, Dropout, Activation
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [2]:
df = pd.read_csv('data_exercise/IMDB Dataset.csv')
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [18]:
train = df.iloc[:int(len(data)*0.8)]
test = df.iloc[int(len(data)*0.8):]

In [19]:
print(len(data_train))
len(data_test)

40000


10000

In [43]:
X_train = np.array(train.review)
y_train = np.array(train.sentiment)
X_test = np.array(test.review)
y_test = np.array(test.sentiment)

In [41]:
# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')

# Vocabulary size and number of words in a sequence.
vocab_size = 10000
sequence_length = 100

# Use the text vectorization layer to normalize, split, and map strings to
# integers. Note that the layer uses the custom standardization defined above.
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

In [71]:
vectorize_layer.adapt(X_train)

In [84]:
embedding_dim=16

model = Sequential([
  vectorize_layer,
  Embedding(vocab_size, embedding_dim, name="embedding"),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'),
  Dense(1, activation='relu')
])

In [47]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [85]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [52]:
le = LabelEncoder()
y_test = le.fit_transform(y_test)
y_train = le.transform(y_train)

In [86]:
model.fit(
    X_train, y_train,
    validation_data=(X_test, y_test),
    epochs=15,
    callbacks=[tensorboard_callback])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x167148d8448>

In [87]:
prediction = model.predict(X_test)

In [92]:
prediction.dtype = int

In [97]:
le.inverse_transform(prediction[0])[0]

'negative'

In [96]:
le.inverse_transform(y_test)[0]

'negative'

In [98]:
X_test[0]

'First off I want to say that I lean liberal on the political scale and I found the movie offensive. I managed to watch the whole doggone disgrace of a film . This movie brings a low to original ideas. Yes it was original thus my 2 stars instead of 1. Are our film writers that uncreative that they can only come up with this?? Acting was horrible , and the characters were unlikeable for the most part. The lead lady in the story had no good qualities at all. They made her bf into some sort of a bad guy and I did not see that at all. Maybe I missed something , I do not know.He was the most down to earth, relevant character in the movie. I did not shell out any money for this garbage. I almost wish PETA would come to the rescue of this awful, offensive movie and form a protest. DISGUSTING thats all I have to say anymore !'

In [100]:
model.save('my_custom_model.tf')

INFO:tensorflow:Assets written to: my_custom_model.tf\assets


## La Forma Alternativa

In [105]:
import warnings

warnings.filterwarnings('ignore')

In [109]:
df = pd.read_csv('data_exercise/IMDB Dataset.csv')

X_train, y_train = np.array(df.review), np.array(df.sentiment.apply(lambda x: 1 if x == "positive" else 0))

In [113]:
# Palabras más relevantes
max_words = 3000

# Se prepara el tokenizador y diccionario
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)
dictionary = tokenizer.word_index

# Se transforma X_train a un array de tokens
def convert_text_to_index_array(text):
    return [dictionary[word] for word in tf.keras.preprocessing.text.text_to_word_sequence(text)]

allWordIndices = []
for text in X_train:
    wordIndices = convert_text_to_index_array(text)
    allWordIndices.append(wordIndices)

allWordIndices = np.asarray(allWordIndices)

# Matriz train_x modo binario
train_x = tokenizer.sequences_to_matrix(allWordIndices, mode='binary')

# Se transforma a array categórico y_train
train_y = tf.keras.utils.to_categorical(y_train, 2)

In [114]:
# Se crea el modelo

model = Sequential([
    Dense(512, input_shape=(max_words,), activation='relu'),
    Dropout(0.5),
    Dense(256, activation='sigmoid'),
    Dropout(0.5),
    Dense(2, activation='softmax')
])

In [115]:
model.compile(loss='categorical_crossentropy',
  optimizer='adam',
  metrics= ["CategoricalAccuracy"])

In [116]:
hist_model = model.fit(train_x, train_y,
  batch_size=32,
  epochs=5,
  verbose=1,
  validation_split=0.1,
  shuffle=True)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
