In [1]:
# RNN을 활용한 텍스트 분류 (Text Classification)
# GPU 옵션 켜져 있는지 확인할 것!!! (수정 - 노트설정 - 하드웨어설정 (GPU))

'''
1. import: 필요한 모듈 import
2. 전처리: 학습에 필요한 데이터 전처리를 수행합니다.
3. 모델링(model): 모델을 정의합니다.
4. 컴파일(compile): 모델을 생성합니다.
5. 학습 (fit): 모델을 학습시킵니다.
'''

'\n1. import: 필요한 모듈 import\n2. 전처리: 학습에 필요한 데이터 전처리를 수행합니다.\n3. 모델링(model): 모델을 정의합니다.\n4. 컴파일(compile): 모델을 생성합니다.\n5. 학습 (fit): 모델을 학습시킵니다.\n'

In [2]:
# NLP Question
'''
For this task you will build a classifier for the sarcasm dataset The classifier should have a final layer with 1 neuron activated by sigmoid as shown.

It will be tested against a number of sentences that the network hasn't previously seen
And you will be scored on whether sarcasm was correctly detected in those sentences

자연어 처리

이 작업에서는 sarcasm 데이터 세트에 대한 분류기를 작성합니다. 분류기는 1 개의 뉴런으로 이루어진 sigmoid 활성함수로 구성된 최종 층을 가져야합니다.
제출될 모델은 데이터셋이 없는 여러 문장에 대해 테스트됩니다. 그리고 당신은 그 문장에서 sarcasm 판별이 제대로 감지되었는지에 따라 점수를 받게 될 것입니다
'''

"\nFor this task you will build a classifier for the sarcasm dataset The classifier should have a final layer with 1 neuron activated by sigmoid as shown.\n\nIt will be tested against a number of sentences that the network hasn't previously seen\nAnd you will be scored on whether sarcasm was correctly detected in those sentences\n\n자연어 처리\n\n이 작업에서는 sarcasm 데이터 세트에 대한 분류기를 작성합니다. 분류기는 1 개의 뉴런으로 이루어진 sigmoid 활성함수로 구성된 최종 층을 가져야합니다.\n제출될 모델은 데이터셋이 없는 여러 문장에 대해 테스트됩니다. 그리고 당신은 그 문장에서 sarcasm 판별이 제대로 감지되었는지에 따라 점수를 받게 될 것입니다\n"

In [3]:
import json
import tensorflow as tf
import numpy as np
import urllib

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional, Flatten
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import ModelCheckpoint

In [4]:
url = 'https://storage.googleapis.com/download.tensorflow.org/data/sarcasm.json'
urllib.request.urlretrieve(url, 'sarcasm.json')

('sarcasm.json', <http.client.HTTPMessage at 0x7f800e061b50>)

In [5]:
with open('sarcasm.json') as f:
  datas = json.load(f)

In [6]:
'''
datas 5개 출력
- article_link: 뉴스 기사 URL
- headline: 뉴스기사의 제목
- is_sarcastic: 비꼬는 기사 여부 (비꼼: 1, 일반: 0)
'''
datas[:5]

[{'article_link': 'https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5',
  'headline': "former versace store clerk sues over secret 'black code' for minority shoppers",
  'is_sarcastic': 0},
 {'article_link': 'https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365',
  'headline': "the 'roseanne' revival catches up to our thorny political mood, for better and worse",
  'is_sarcastic': 0},
 {'article_link': 'https://local.theonion.com/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697',
  'headline': "mom starting to fear son's web series closest thing she will have to grandchild",
  'is_sarcastic': 1},
 {'article_link': 'https://politics.theonion.com/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302',
  'headline': 'boehner just wants wife to listen, not come up with alternative debt-reduction ideas',
  'is_sarcastic': 1},
 {'article_link': 'https://www.huffingtonpost.com/entry/jk-rowling-w

In [7]:
# X (Feature): sentences, Y (Label): label
sentences = []
labels = []

for data in datas:
    sentences.append(data['headline'])
    labels.append(data['is_sarcastic'])

In [8]:
sentences[:5]

["former versace store clerk sues over secret 'black code' for minority shoppers",
 "the 'roseanne' revival catches up to our thorny political mood, for better and worse",
 "mom starting to fear son's web series closest thing she will have to grandchild",
 'boehner just wants wife to listen, not come up with alternative debt-reduction ideas',
 'j.k. rowling wishes snape happy birthday in the most magical way']

In [9]:
labels[:5]

[0, 0, 1, 1, 0]

In [11]:
training_size = 20000

train_sentences = sentences[:training_size]
train_labels = labels[:training_size]

validation_sentences = sentences[training_size:]
validation_labels = labels[training_size:]

In [12]:
# OOV -> Out of Vocab
# 보통 아래 하이퍼파라미터는 지정
# num_words: 단어 max 사이즈를 지정합니다. 가장 빈도수가 높은 단어부터 저장합니다.
# oov_token: 단어 토큰에 없는 단어를 어떻게 표기할 것인지 지정해줍니다.

vocab_size = 1000
oov_tok = "<OOV>"

tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")

In [13]:
tokenizer.fit_on_texts(train_sentences)
print(train_sentences[4])

j.k. rowling wishes snape happy birthday in the most magical way


In [14]:
for key, value in tokenizer.word_index.items():
    print('{}  \t======>\t {}'.format(key, value))
    if value == 25:
        break



In [16]:
len(tokenizer.word_index)

25637

In [17]:
word_index = tokenizer.word_index

In [18]:
# texts_to_sequences: 문장을 숫자로 치환 합니다. Train Set, Valid Set 모두 별도로 적용해주어야 합니다.
train_sequences = tokenizer.texts_to_sequences(train_sentences)
validation_sequences = tokenizer.texts_to_sequences(validation_sentences)

In [19]:
train_sentences[4]

'j.k. rowling wishes snape happy birthday in the most magical way'

In [20]:
word_index['j'], word_index['k'], word_index['rowling'], word_index['wishes'], word_index['snape'], word_index['happy']

(715, 672, 5652, 1043, 8865, 662)

In [21]:
train_sequences[4]

[715, 672, 1, 1, 1, 662, 553, 5, 4, 92, 1, 90]

In [22]:
'''
maxlen: 최대 문장 길이를 정의합니다. 최대 문장길이보다 길면, 잘라냅니다.
truncating: 문장의 길이가 maxlen보다 길 때 앞을 자를지 뒤를 자를지 정의합니다.
padding: 문장의 길이가 maxlen보다 짧을 때 채워줄 값을 앞을 채울지, 뒤를 채울지 정의합니다.
'''

# 한 문장의 최대 단어 숫자
max_length = 120

# 잘라낼 문장의 위치
trunc_type='post'

# 채워줄 문장의 위치
padding_type='post'

In [23]:
train_padded = pad_sequences(train_sequences, maxlen=max_length, truncating=trunc_type, padding=padding_type)
validation_padded = pad_sequences(validation_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

In [24]:
train_padded.shape

(20000, 120)

In [25]:
# model이 list type은 받아들이지 못하므로, numpy array로 변환합니다.
train_labels = np.array(train_labels)
validation_labels = np.array(validation_labels)

In [26]:
embedding_dim=16

In [28]:
#LSTM 층들을 겹쳐서 사용할 때 마지막 LSTM 층을 제외하고 무조건 return_sequence=True로 지정
model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=max_length),
    Bidirectional(LSTM(64, return_sequences=True)),
    Bidirectional(LSTM(64)),
    Dense(32, activation='relu'),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 120, 16)           16000     
                                                                 
 bidirectional (Bidirectiona  (None, 120, 128)         41472     
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense (Dense)               (None, 32)                4128      
                                                                 
 dense_1 (Dense)             (None, 16)                528       
                                                                 
 dense_2 (Dense)             (None, 1)                 1

In [29]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

In [30]:
checkpoint_path = 'my_checkpoint.ckpt'
checkpoint = ModelCheckpoint(checkpoint_path, 
                             save_weights_only=True, 
                             save_best_only=True, 
                             monitor='val_loss',
                             verbose=1)

In [31]:
epochs=10

In [32]:
history = model.fit(train_padded, train_labels, 
                    validation_data=(validation_padded, validation_labels),
                    callbacks=[checkpoint],
                    epochs=epochs)

Epoch 1/10
Epoch 00001: val_loss improved from inf to 0.39465, saving model to my_checkpoint.ckpt
Epoch 2/10
Epoch 00002: val_loss improved from 0.39465 to 0.37381, saving model to my_checkpoint.ckpt
Epoch 3/10
Epoch 00003: val_loss did not improve from 0.37381
Epoch 4/10
Epoch 00004: val_loss did not improve from 0.37381
Epoch 5/10
Epoch 00005: val_loss improved from 0.37381 to 0.36781, saving model to my_checkpoint.ckpt
Epoch 6/10
Epoch 00006: val_loss did not improve from 0.36781
Epoch 7/10
Epoch 00007: val_loss did not improve from 0.36781
Epoch 8/10
Epoch 00008: val_loss did not improve from 0.36781
Epoch 9/10
Epoch 00009: val_loss did not improve from 0.36781
Epoch 10/10
Epoch 00010: val_loss did not improve from 0.36781


In [33]:
model.load_weights(checkpoint_path)

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f8002a77fd0>