In [113]:
import pandas as pd
df = pd.read_csv('datasets/sentiment labelled sentences/yelp_labelled.txt', names=['sentence','label'],sep='\t')

In [114]:
df.head()

Unnamed: 0,sentence,label
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [115]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sentence  1000 non-null   object
 1   label     1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


### ISeng cek jumlah total kata di dataset

In [116]:
cek = df['sentence'].str.split().apply(len).tolist()
cek = pd.DataFrame(cek, columns=['words_count'])
cek['words_count'].sum()

10894

In [117]:
from sklearn.model_selection import train_test_split
kalimat = df['sentence'].values
y = df['label'].values
kalimat_train, kalimat_test, y_train, y_test = train_test_split(kalimat, y, test_size=0.2)

Agar teks dapat dipahami oleh model, kita harus lakukan tokenisasi. Gunakan fungsi tokenizer pada data latih dan data test. Jangan lupa gunakan fungsi pad_sequences agar setiap sequence sama panjang.

In [118]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [119]:
kalimat_train[1]

'My wife had the Lobster Bisque soup which was lukewarm.'

### Ubah kalimat menjadi token

In [120]:
tokenizer = Tokenizer(num_words=750, oov_token='x')
tokenizer.fit_on_texts(kalimat_train)
tokenizer.fit_on_texts(kalimat_test)

### Mengubah token ke sequences

In [121]:
train_sequences = tokenizer.texts_to_sequences(kalimat_train)
test_sequences = tokenizer.texts_to_sequences(kalimat_test)

### Iseng cek jumlah maksimal kata dalam tiap row/sequences datasets

In [122]:
maks = 0
for n in train_sequences:
    length = len(n)
    if length>maks:
        maks = length
print(maks)

32


In [123]:
train_sequences[1]

[21, 431, 23, 2, 432, 588, 351, 74, 5, 589]

terlihat panjang tiap sequences berbeda, maka dikasih padding biar sama panjang

In [124]:
padded_train = pad_sequences(train_sequences, maxlen=32)
padded_test = pad_sequences(test_sequences, maxlen=32)

In [125]:
padded_train[1]

array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,  21, 431,  23,   2,
       432, 588, 351,  74,   5, 589], dtype=int32)

panjang tiap sequences sudah sama

In [126]:
for n in padded_train[1:5]:
    length = len(n)
    print(length)

32
32
32
32


### Make the Model with Embedding

Untuk arsitektur yang akan digunakan adalah layer embedding, dengan argumen pertama sesuai dengan jumlah vocabulary/kata yang kita pakai pada tokenizer. Argumen selanjutnya adalah dimensi embedding, dan input_length yang merupakan panjang dari sequence. Nah di kita tidak menggunakan layer Flatten melainkan kita menggantinya dengan GlobalAveragePooling1D. Fungsi ini bekerja lebih baik pada kasus NLP dibanding Flatten.

The size of the vocabulary: The embedding dimension should be small enough to avoid overfitting, but large enough to capture the complexity of the vocabulary. As a rule of thumb, the dimensionality of the embedding vector should be proportional to the square root of the size of the vocabulary.

In [127]:
import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(750, 27, input_length=32),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [128]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### Train model

In [129]:
history = model.fit(padded_train, y_train, epochs=30,
                    validation_data=(padded_test, y_test), 
                    verbose=2)

Epoch 1/30
25/25 - 2s - loss: 0.6930 - accuracy: 0.5475 - val_loss: 0.6920 - val_accuracy: 0.6650 - 2s/epoch - 65ms/step
Epoch 2/30
25/25 - 0s - loss: 0.6909 - accuracy: 0.6488 - val_loss: 0.6902 - val_accuracy: 0.7100 - 111ms/epoch - 4ms/step
Epoch 3/30
25/25 - 0s - loss: 0.6878 - accuracy: 0.7088 - val_loss: 0.6875 - val_accuracy: 0.6650 - 163ms/epoch - 7ms/step
Epoch 4/30
25/25 - 0s - loss: 0.6820 - accuracy: 0.7387 - val_loss: 0.6806 - val_accuracy: 0.7000 - 108ms/epoch - 4ms/step
Epoch 5/30
25/25 - 0s - loss: 0.6720 - accuracy: 0.7287 - val_loss: 0.6695 - val_accuracy: 0.6950 - 169ms/epoch - 7ms/step
Epoch 6/30
25/25 - 0s - loss: 0.6553 - accuracy: 0.7487 - val_loss: 0.6552 - val_accuracy: 0.7800 - 132ms/epoch - 5ms/step
Epoch 7/30
25/25 - 0s - loss: 0.6312 - accuracy: 0.7962 - val_loss: 0.6311 - val_accuracy: 0.7500 - 203ms/epoch - 8ms/step
Epoch 8/30
25/25 - 0s - loss: 0.5987 - accuracy: 0.8025 - val_loss: 0.6078 - val_accuracy: 0.8000 - 175ms/epoch - 7ms/step
Epoch 9/30
25/25 -