# Analyse du sentiment (Imdb)

In [1]:
import pandas as pd

import numpy
# Inutile :
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

import random

Using TensorFlow backend.


## Préparation et codage des données

In [2]:
data_train = pd.read_csv('imdb_train.csv')
data_test = pd.read_csv('imdb_test.csv')

In [3]:
print(data_train.Review[1])
print(data_train.Eval[1])

Okay, I know I shouldn't like this movie but I do. From Pat Morita's loveable interpretation of a Japanese stereotype to Jay Leno's annoying yell, I laughed throughout this movie.As long as you take into account that this is not the best movie in the world, it's a good mvie.<br /><br />My favorite part is Morita talking to his boss in Tokyo with the drinking a close second.
1


On sauvegarde les phrases avant transformation dans R_train, qu'on utilisera plus tard pour vérifier les prédictions

In [4]:
R_train = data_train.Review
Y_train = data_train.Eval

On utilise le *tokeniser* de Keras pour convertir les phrases en séquences de nombres (équivalent au codage avec *one_hot*)  
On retient les 5000 mots qui apparaissent le plus souvent (*top_words*)

In [None]:
from keras.preprocessing.text import Tokenizer, text_to_word_sequence

top_words = 5000
tokenizer = Tokenizer(num_words=top_words, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(R_train)
X_train = tokenizer.texts_to_sequences(R_train)

In [None]:
print(R_train[1])
print(X_train[1])

On transforme de même les données de test

In [None]:
R_test = data_test.Review
Y_test = data_test.Eval
X_test = tokenizer.texts_to_sequences(R_test)

On complète les séquences avec des 0 à gauche pour avoir des séquences de même longueur (ici 500)

In [None]:
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words, padding='post')
X_test = sequence.pad_sequences(X_test, maxlen=max_words, padding='post')

In [None]:
print(X_train[1])

## Réseaux denses

On commence avec un modèle dense simple

In [None]:
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

In [None]:
# Apprentissage
model.fit(X_train, Y_train, epochs=2, batch_size=128, verbose=2)
# Evaluation
scores = model.evaluate(X_test, Y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

On calcule les prédictions

In [None]:
Y_mlp = model.predict(X_test)

### Test  
si prediction proche de 0 : avis négatif  
si prediction proche de 1 : avis positif

In [None]:
ir = random.randint(1, X_test.shape[0])
print(R_test[ir])
print(Y_mlp[ir])

## Réseaux convolutionnels

In [None]:
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers import Dropout

On utilise une couche convolutionnelle à une dimension

In [None]:
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Conv1D(filters=32, kernel_size=10, input_shape=(32, 1), activation="relu"))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

In [None]:
# Apprentissage
model.fit(X_train, Y_train, epochs=10, batch_size=128, verbose=2)
# Evaluation
scores = model.evaluate(X_test, Y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

In [None]:
Y_cnn1 = model.predict(X_test)

In [None]:
ir = random.randint(1, X_test.shape[0])
print(R_test[ir])
print(Y_cnn1[ir])

Avec un réseau un peu plus complexe :

In [None]:
model = Sequential()
model.add(Embedding(top_words, 128, input_length=max_words))
model.add(Conv1D(filters=128, kernel_size=10, input_shape=(128, 1), activation="relu"))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.2))
model.add(Conv1D(filters=64, kernel_size=3, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

In [None]:
# Apprentissage
model.fit(X_train, Y_train, epochs=10, batch_size=128, verbose=2)
# Evaluation
scores = model.evaluate(X_test, Y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

In [None]:
Y_cnn2 = model.predict(X_test)

In [None]:
ir = random.randint(1, X_test.shape[0])
print(R_test[ir])
print(Y_cnn2[ir])

## Réseaux récurrents (LSTM)

In [None]:
from keras.layers import LSTM

In [None]:
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Attention :l'apprentissage avec des LSTM peut être assez long ...

In [None]:
# Apprentissage
model.fit(X_train, Y_train, epochs=10, batch_size=128, verbose=2)
# Evaluation
scores = model.evaluate(X_test, Y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Pour les installation avec *Tensorflow GPU*, on peut utiliser *CuDNNLSTM*, beaucoup plus rapide :

In [None]:
from keras.layers import CuDNNLSTM, BatchNormalization

In [None]:
model = Sequential()
model.add(Embedding(top_words, 100, input_length=max_words))
model.add(CuDNNLSTM(128))
model.add(BatchNormalization())
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

In [None]:
# Apprentissage
model.fit(X_train, Y_train, epochs=10, batch_size=128, verbose=2)
# Evaluation
scores = model.evaluate(X_test, Y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

## Exercice

Tester les différentes approches sur le dataset des Newsgroups Reuters  
https://keras.io/datasets/#reuters-newswire-topics-classification  
  
  
Il s'agit de classer les news dans un des 46 groupes

from keras.datasets import reuters  
top_words = 30000  
(X_train, y_train), (X_test, y_test) = reuters.load_data(num_words=top_words)

La lecture directe du dataset ne nécessite pas la phase de tokenisation/codage

**Indication** : on doit ici réaliser une classification dans des catégories multiples (un peu différent de la décision 0/1 de l'exemple Imdb). On pourra s'inspirer de l'exemple *mnist*