### IMDB klasifikacija sentimenta filmskih recenzija

U projektu analizirat ćemo podatke filmskih recenzija te kreirati model za predviđanje sentimenta recenzije.

### Postavljanje razvojne okoline.

In [81]:
import numpy as np
import matplotlib.pyplot as plt

from keras.datasets import imdb
from keras.preprocessing import sequence
from keras.models import Model
from keras.layers import Embedding, LSTM, Dense, Input, Bidirectional
from keras.regularizers import l2

### Učitavanje podataka.

In [82]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

Osiguravamo da svi podaci imaju jednaku duljinu. To je bitno jer neuronska mreža očekuje ulazne podatke u uniformnom obliku. Prvo ćemo pogledati koja je prosječna duljina recenzija.

In [83]:
lengths = [len(x) for x in x_train]
avg_length = sum(lengths) / len(lengths)
print(avg_length)

238.71364


Prosječna duljina recenzije je 238.71 riječ. Kao standardiziranu duljinu uzet ćemo okvirno tu vrijednost, 200 riječi.

In [84]:
x_train = sequence.pad_sequences(x_train, maxlen=200)
x_test = sequence.pad_sequences(x_test, maxlen=200)

Ukoliko pogledamo primjer recenzije filma vidjet ćemo da je riječ o listi integera. Originalni podatak je bio rečenica no ona je pretvorena u listu integera gdje riječi zamjenjuje frekvencija riječi u skupu podataka.

In [85]:
print(x_train[0])

[   5   25  100   43  838  112   50  670    2    9   35  480  284    5
  150    4  172  112  167    2  336  385   39    4  172 4536 1111   17
  546   38   13  447    4  192   50   16    6  147 2025   19   14   22
    4 1920 4613  469    4   22   71   87   12   16   43  530   38   76
   15   13 1247    4   22   17  515   17   12   16  626   18    2    5
   62  386   12    8  316    8  106    5    4 2223 5244   16  480   66
 3785   33    4  130   12   16   38  619    5   25  124   51   36  135
   48   25 1415   33    6   22   12  215   28   77   52    5   14  407
   16   82    2    8    4  107  117 5952   15  256    4    2    7 3766
    5  723   36   71   43  530  476   26  400  317   46    7    4    2
 1029   13  104   88    4  381   15  297   98   32 2071   56   26  141
    6  194 7486   18    4  226   22   21  134  476   26  480    5  144
   30 5535   18   51   36   28  224   92   25  104    4  226   65   16
   38 1334   88   12   16  283    5   16 4472  113  103   32   15   16
 5345 

Dohvaćamo liste riječi i frekvencija. Riječ je o dictionaryu koji sadrži vrijednosti u obliku \{riječ: frekvencija\} stoga trebamo reverseati sadržaj dictionarya.

In [86]:
word_index = imdb.get_word_index()

reverse_word_index = {value: key for key, value in word_index.items()}

Zapisi u listi zapravo ne odgovaraju frekvenciji nego vrijednost zapisa umanjena za 3 odgovara frekvenciji. Razlog za to je jer se s vrijednošću zapisa 0 označava \<PAD\>, 1 označava \<START\> i 2 označava \<UNKNOWN\>.

In [87]:
decoded_sentence = ' '.join([reverse_word_index.get(i-3, '?') for i in x_train[0]])
print(decoded_sentence)

and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all


Možemo primijetiti da nam neke riječ nedostaju. Razlog za to je jer koristimo num_words=10000 čime se ograničavamo samo na 10 000 najčešćih riječi. Ukoliko povećamo taj broj na recimo 20 000 dobivat ćemo bolje rezultate.

In [88]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=20000)

x_train = sequence.pad_sequences(x_train, maxlen=200)
x_test = sequence.pad_sequences(x_test, maxlen=200)

decoded_sentence = ' '.join([reverse_word_index.get(i-3, '?') for i in x_train[0]])
print(decoded_sentence)

and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that w

### Izrada modela za treniranje

Kao model za treniranje podataka koristit će se Recurrent Neural Network (RNN) s LSTM (engl. Long Short-Term Memory) slojem.

In [90]:
def rnn(x_train, y_train, x_test, y_test):
  x_input = Input(shape=(200,))

  # pretvara frekvenciju u vektor koji omogućava postizanje semantičkih odnosa
  x = Embedding(20000, 128)(x_input)

  # pamti dugoročne ovisnosti u sekvencijskim podacima obrađujući podatke u oba smjera
  x = Bidirectional(LSTM(128, dropout=0.3, recurrent_dropout=0.3))(x)

  # proizvodi vjerojatnost da izraz pripada jednoj od dvije očekivane klase
  x = Dense(1, activation='sigmoid', kernel_regularizer='l2')(x)

  # kreiramo model na temelju izlaznih i ulaznih podataka
  model = Model(inputs=x_input, outputs=x)

  # prikaz sažetka modela
  model.summary()

  # priprema za compile
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

  # treniranje modela
  model.fit(x_train, y_train, batch_size=64, epochs=5, verbose=1, validation_data=(x_test, y_test))



Pokretanje za treniranje i testiranje na sentimentima: negativno, pozitivno

In [57]:
rnn(x_train, y_train, x_test, y_test)

Epoch 1/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m199s[0m 502ms/step - accuracy: 0.6459 - loss: 0.6236 - val_accuracy: 0.8246 - val_loss: 0.4141
Epoch 2/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m189s[0m 484ms/step - accuracy: 0.8616 - loss: 0.3553 - val_accuracy: 0.8431 - val_loss: 0.3950
Epoch 3/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m189s[0m 484ms/step - accuracy: 0.8759 - loss: 0.3196 - val_accuracy: 0.8344 - val_loss: 0.4330
Epoch 4/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m190s[0m 486ms/step - accuracy: 0.8917 - loss: 0.2883 - val_accuracy: 0.8502 - val_loss: 0.3823
Epoch 5/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m188s[0m 481ms/step - accuracy: 0.9211 - loss: 0.2241 - val_accuracy: 0.8543 - val_loss: 0.3955
Epoch 6/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m195s[0m 499ms/step - accuracy: 0.9339 - loss: 0.1995 - val_accuracy: 0.7866 - val_loss: 0.5282
Epoc

U gornjem izvođenju primjećujemo prenaučenost, stoga smanjujem broj epoha na 4. 

In [91]:
rnn(x_train, y_train, x_test, y_test)

Epoch 1/5
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m197s[0m 494ms/step - accuracy: 0.6898 - loss: 0.5917 - val_accuracy: 0.8173 - val_loss: 0.4281
Epoch 2/5
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m202s[0m 517ms/step - accuracy: 0.8545 - loss: 0.3638 - val_accuracy: 0.8387 - val_loss: 0.4139
Epoch 3/5
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m194s[0m 496ms/step - accuracy: 0.8817 - loss: 0.3092 - val_accuracy: 0.8397 - val_loss: 0.3938
Epoch 4/5
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m194s[0m 497ms/step - accuracy: 0.8955 - loss: 0.2854 - val_accuracy: 0.8438 - val_loss: 0.3920
Epoch 5/5
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m187s[0m 478ms/step - accuracy: 0.9146 - loss: 0.2409 - val_accuracy: 0.8489 - val_loss: 0.4165


Iz dobivenih podataka vidimo da model radi s preciznošću od 84.89%. Dobiveni rezultati su vrlo dobri.

### Model bez 20 najčešćih riječi

Trenirat ćemo model koji neće imati pristup 20 najčešćih riječi. Ideja ovog pristupa je da prilikom treniranja modela ne uzimamo najčešće riječi jer one najčešće nema značaj za sentiment.

Dohvaćamo podatke.

In [92]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=20000, skip_top=20, oov_char=2)

Izostavljene riječi su:

In [93]:
print([reverse_word_index[i] for i in range(1, 21)])

['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i', 'this', 'that', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'on']


Možemo vidjeti da prvih 20 riječi uistinu nemaju značaj za sentiment rečenice.

In [94]:
x_train = sequence.pad_sequences(x_train, maxlen=200)
x_test = sequence.pad_sequences(x_test, maxlen=200)

In [96]:
rnn(x_train, y_train, x_test, y_test)

Epoch 1/5
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m190s[0m 477ms/step - accuracy: 0.6538 - loss: 0.6162 - val_accuracy: 0.8430 - val_loss: 0.3861
Epoch 2/5
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m211s[0m 539ms/step - accuracy: 0.8314 - loss: 0.3959 - val_accuracy: 0.7026 - val_loss: 0.5907
Epoch 3/5
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m186s[0m 477ms/step - accuracy: 0.8378 - loss: 0.4028 - val_accuracy: 0.8472 - val_loss: 0.3709
Epoch 4/5
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m185s[0m 475ms/step - accuracy: 0.9085 - loss: 0.2528 - val_accuracy: 0.8608 - val_loss: 0.3469
Epoch 5/5
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m191s[0m 488ms/step - accuracy: 0.9370 - loss: 0.1866 - val_accuracy: 0.8649 - val_loss: 0.3599


Uklanjanjem 20 najčešćih riječi postigli smo bolje rezultate. Iako smo u drugoj epohi zabilježili neočekivan pad performansi na validacijskom skupu, model za klasifikaciju sentimenta na testnim podacima na kraju je postigao 86.49% preciznosti, što je vrlo dobar rezultat.