https://vgpena.github.io/classifying-tweets-with-keras-and-tensorflow/

En el anterior enlace, tenéis un ejemplo sobre cómo, a partir de tweets con un label específico (un sentimiento, positivo o negativo): 

1. Genera un conjunto de entrenamiento. El conjunto de entrenamiento es formado a partir de tweets completos pasados a un array con un tamaño específico.
2. Ese array (X_train de tamaño N) tiene un label que representa el sentimiento (y_train)
3. Como todas las frases tienen un tamaño N, la entrada de la red neuronal será de tamaño N y la salida de la red será de tamaño 2 usando activación softmax(porque hay dos clases).

Se pide: 

- Realizar un clasificador de reviews para el dataset de IMDB de la carpeta data_exercise/

**Cuando usa la importación "keras.x", reemplázalo por "tensorflow.keras.x"**

In [59]:
# Your code
"""
De
from keras.preprocessing.text import Tokenizer
Usa
from tensorflow.keras.preprocessing.text import Tokenizer
"""
print()




In [60]:
import pandas as pd
import numpy as np
import tensorflow as tf 
import re 
import string 
import json


import tensorflow.keras.preprocessing.text as kpt
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.models import model_from_json

In [61]:
import pandas as pd

df = pd.read_csv('../exercise/data_exercise/IMDB_Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Hacemos label encoder

In [62]:
y = df['sentiment']

In [63]:
le = LabelEncoder()
sentiment_le = le.fit_transform(y)

In [64]:
df["sentiment_le"] = sentiment_le
df.head()

Unnamed: 0,review,sentiment,sentiment_le
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


## Hacemos split

In [65]:
X = df["review"]
y = df["sentiment_le"]

In [66]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Quitamos los símbolos del texto

In [67]:
# quitamos tags de HTML '<br />'.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data) #le pasa los strings a minuscula
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')# le quita el br (salto de linea en html) 
  return tf.strings.regex_replace(stripped_html,'[%s]' % re.escape(string.punctuation), '') 

custom_standardization(input_data=X_train)
custom_standardization(input_data=X_test)

<tf.Tensor: shape=(10000,), dtype=string, numpy=
array([b'endearingly silly anime only six episodes in duration about a hapless delivery boy called kintaro well hes called a delivery boy though he is meant to be in his 20s and the adventures he has on his travels each episode sees him arriving in a new town acquiring a new job developing something of a love interest before each episode ends with him leaving  gently sexist juvenile very immature at times this is the kind of anime that just puts a smile on the face  not one to start with if you are not a fan of anime as this certainly wont convince you about the genre but for those who are already converted this is entertaining fluff',
       b'while a 9 might seem like an unusually high score for such a slight film however compared to the hundreds and hundreds of series detective films from the 1930s and 40s this is among the very best and also compares very favorably to powells later thin man films now this does not mean that the film 

## Hacemos el tokenizer

In [68]:
## Tamaño máximo de palabras y Vocabulary size and number of words in a sequence:
max_words = 1000
sequence_length = 100

tokenizer = Tokenizer(num_words=max_words)# create a new Tokenizer
tokenizer.fit_on_texts(X_train)# feed our tweets to the Tokenizer
dictionary = tokenizer.word_index # Tokenizers come with a convenient list of words and IDs

## Guardamos el tokenizer/diccionary en un json

In [69]:
with open('dictionary.json', 'w') as dictionary_file:
    json.dump(dictionary, dictionary_file)

## Convertimos el texto a array con el tokenizer

In [70]:
def convert_text_to_index_array(text):
    # one really important thing that `text_to_word_sequence` does is make all texts the same length -- in this case, the length of the longest text in the set.
    return [dictionary[word] for word in kpt.text_to_word_sequence(text)]

In [71]:
allWordIndices = []
# for each review, change each token to its ID in the Tokenizer's word_index
for text in X_train:
    wordIndices = convert_text_to_index_array(text)
    allWordIndices.append(wordIndices)

In [72]:
# now we have a list of all review converted to index arrays. Cast as an array for future usage.
allWordIndices = np.asarray(allWordIndices)

# create one-hot matrices out of the indexed tweets
X_train = tokenizer.sequences_to_matrix(allWordIndices, mode='binary')
# treat the labels as categories
y_train = tf.keras.utils.to_categorical(y_train, 2)

## Creamos el modelo

In [73]:
model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))

In [74]:
model.compile(loss='categorical_crossentropy',
  optimizer='adam',
  metrics=['accuracy'])

## Entrenamos el modelo

In [75]:
model.fit(X_train, y_train,
        batch_size=32, #data in groups of batch_size
        epochs=5, #epochs is how many times you do this batch-by-batch splitting. I’ve found 5 to be good in this  case; I tried 7, but ended up overfitting.
        verbose=1,
        validation_split=0.1,
        shuffle=True)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x20974026160>

## Guardamos el modelo

In [76]:
model_json = model.to_json()
with open('model.json', 'w') as json_file:
    json_file.write(model_json)

model.save_weights('model.h5')

## Leemos el modelo

In [77]:
# read in our saved dictionary
with open('dictionary.json', 'r') as dictionary_file:
    dictionary = json.load(dictionary_file)

In [78]:
def convert_text_to_index_array(text):
    words = kpt.text_to_word_sequence(text)
    wordIndices = []
    for word in words:
        if word in dictionary:
            wordIndices.append(dictionary[word])
        else:
            print("'%s' not in training corpus; ignoring." %(word))
    return wordIndices

In [79]:
# read in your saved model structure
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
# and create a model from that
model = model_from_json(loaded_model_json)
# and weight your nodes with your saved values
model.load_weights('model.h5')

## Parte interactiva

In [80]:
# for human-friendly printing
labels = ['negative', 'positive']

while 1:
    evalSentence = input('Input a sentence to be evaluated, or Enter to quit: ')

    if len(evalSentence) == 0:
        break
 
    # format your input for the neural net
    testArr = convert_text_to_index_array(evalSentence)
  
    input_f = tokenizer.sequences_to_matrix([testArr], mode='binary')

    # predict which bucket your input belongs in
    pred = model.predict(input_f)
    # and print it for the humons
    print(evalSentence)
    print("%s sentiment; %f%% confidence" % (labels[np.argmax(pred)], pred[0][np.argmax(pred)] * 100))

Good movie
positive sentiment; 57.277578% confidence
I hate it
negative sentiment; 63.744241% confidence
I love it
positive sentiment; 96.705806% confidence
Really bad
negative sentiment; 99.362791% confidence
Really good
positive sentiment; 83.517128% confidence
