1. You can pick one of the models. Please feel free to use pytorch/tensorflow tutorials or any 
existing source codes. There is a restriction to choosing a topic. The model and source code 
you pick should be different from your final project topic. It could be related but the source 
code should be different. 

IMDB Movie Review Sentimental Analysis.

IMDB Large Movie Dataset consists of 50,000 movie review texts written in English, with a label of 1 positive and 0 negative. Of these, 25000 are contained to be used as training data, and the remaining 25000 are contained to be used as test data.

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

In [None]:
# load IMDB dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=5000)

In [None]:
# check the data
print(x_train[0])
print(y_train[0])
print(len(x_train[0]))

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 2, 19, 178, 32]
1
218


In [None]:
# laod word_index
word_to_index = keras.datasets.imdb.get_word_index()
index_to_word = {index:word for word, index in word_to_index.items()}

In [None]:
index_to_word[1]

'the'

In [None]:
# padding 
x_train = keras.preprocessing.sequence.pad_sequences(x_train,
                                                        maxlen=500)

x_test = keras.preprocessing.sequence.pad_sequences(x_test,
                                                       maxlen=500)

In [None]:
# train model
vocab_size = 5000    # Size of vocaburary
word_vector_dim = 300  # Word vector dimension

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, word_vector_dim, input_shape=(None,)))
model.add(keras.layers.LSTM(120))   
model.add(keras.layers.Dense(1, activation='sigmoid')) # positive or negative

model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])

epochs = 10

history = model.fit(x_train,
                    y_train,
                    epochs=epochs,
                    batch_size=64,
                    validation_split=0.2,
                    verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


It seems difficult for word vectors to learn significantly just by briefly learning the emotional classification task. It is said that it is difficult to learn word vectors with this level of training data. Therefore, let's use a pre-learned word embedding model called Word2Vec provided by Google.

In [None]:
from gensim.models import KeyedVectors

word2vec_path = 'GoogleNews-vectors-negative300.bin.gz'
word2vec = KeyedVectors.load_word2vec_format(word2vec_path, binary=True, limit=None)

In [None]:
# setting embedding layer
vocab_size = 5000   
word_vector_dim = 300 



In [None]:
# copy Word2Vec vector in the embedding_matrix

embedding_matrix = np.random.rand(vocab_size, word_vector_dim)
for i in range(4,vocab_size):
    if index_to_word[i] in word2vec:
        embedding_matrix[i] = word2vec[index_to_word[i]]

In [None]:
# trian model

from tensorflow.keras.initializers import Constant

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 
                                 word_vector_dim, 
                                 embeddings_initializer=Constant(embedding_matrix),  # 카피한 임베딩을 여기서 활용
                                 input_length=500, 
                                 trainable=True))  # trainable을 True로 주면 Fine-tuning
model.add(keras.layers.LSTM(128))
model.add(keras.layers.Dense(1,activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

epochs = 10

history = model.fit(x_train,
                    y_train,
                    epochs=epochs,
                    batch_size=64,
                    validation_split=0.2,
                    verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
# train with GRU
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 
                                 word_vector_dim, 
                                 embeddings_initializer=Constant(embedding_matrix),  # 카피한 임베딩을 여기서 활용
                                 input_length=500, 
                                 trainable=True))  # trainable을 True로 주면 Fine-tuning
model.add(keras.layers.GRU(128))
model.add(keras.layers.Dense(1,activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

epochs = 5

history = model.fit(x_train,
                    y_train,
                    epochs=epochs,
                    batch_size=64,
                    validation_split=0.2,
                    verbose=1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


(a) Code should be run already by you. At the end of the code, please describe what you’ve
learned in this practice or any observations (at least three sentences).

It was my first time doing NLP, and i wanted to use the LSTM i learned in class. First of all, what i learned while implementing the model is how the text enters the deep learnning model. I learned how to handel letters when they are different in length and how to handel the first observed word. Secondly, it was an opportunity to study GRU as well as LSTM learned in class. With this opportunity, i will try more interesting research by applying NLP to my research.

(b) Extra Bonus, Given the code, can you try to improve the performances? Please describe
your strategy and show the results. 

To impove the model performances, i tried to apply word-to-vec pretrained by big scale dataset. It allows me to pick a vector that can vetter measure the similarity between words. Next, model performance was further improved through GRU instead of LSTM