# 과제3 - 사례2 : IMDB 영화평 긍정/부정 분류

- epoch을 6번 진행한 네트워크의 정확도와 오차 : 
  - Epoch 6 훈련데이터 정확도 및 오차 : 
    - Accuracy : 0.9568 Loss : 0.1233
  - Result  : 
    - Accuracy: 0.8746 Loss: 0.3367 

- 해석 : 훈련데이터는 epoch수가 증가했으므로 더 많이 학습데이터를 훈련시키게 되어 정확도가 4일때 보다 올라갔고, 하지만 validation set이 설정이 안되어있는 모델이었기때문에 TEST 정확도는 향상되지 않았다.

- IMDB 영화평을 분류하는 사례 2를 코드 수행은 아래에서 확인하실 수 있습니다.


### IMDB 데이터

- Internet Movie Database에는 일반인들의 영화평 50,000개가 저장되어 있고, 영화평은 텍스트로 구성되는데 긍정 또는 부정의 태그가 붙어있다. 이를 이용해서 영화평 텍스트로부터 평가가 긍정인지 부정인지 추정하는 모델을 만드는 것이 목표


### 1. IMDB 데이터 읽기 

- IMDB 데이터는 keras에서 읽어옴
- 아래 프로그램에서는 텍스트에서 사용하는 단어수를 10,000개로 제한

In [1]:
from keras.datasets import imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data( num_words=10000)

- keras에서는 전체 데이터에서 단어 빈도수를 조사하고 상위 10,000개의 단어를 사용해서 각 리뷰를 숫자 리스트로 표시
- 숫자 리스트 데이터로는 무슨 내용인지 알 수 없음

In [2]:
print(train_data[0] )

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


In [3]:
# 라벨값이 1이라는 것이 긍정 코멘트를 의미, 라벨값이 0이라는 것은 부정 코멘트를 의미
train_labels[0]

1

### 2. IMDB 텍스트 보기 
- 리뷰 내용을 보려면 get_word_index 함수를 이용하여 처리

In [4]:
word_index = imdb.get_word_index() 
reverse_word_index = dict( [(value, key) for (key, value) in word_index.items()]) 
decoded_review = ' '.join( [reverse_word_index.get(i - 3, '?') for i in train_data[0]])

- train_data[0]의 내용

In [5]:
decoded_review

"? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you th

### 신경망 구조 
- 텍스트 데이터에서 평가의 긍정/부정 여부를 판단하기 위해 다음과 같은 신경망 구조를 이용함
- 텍스트 입력은 10,000 크기의 Document-term matrix로 표현
  - 1단: 10,000 x 16 Dense (ReLU)
  - 2단: 16 x 16 Dense (ReLU)
  - 출력단: 16 x 1 (Sigmoid)
- 기본 개념: 어떤 단어들로 리뷰했는지를 보고긍정/부정 여부를 판단하는 것임

![image.png](attachment:image.png)

### 3. 입력 데이터 변환
- 각 영화평을 10,000 크기의 Document-term matrix로 표현
- 다음과 같은 vectorize_sequences 함수를 이용하여 DTM을 구축

In [6]:
import numpy as np 
# 입력 데이터
def vectorize_sequences(sequences, dimension=10000): 
    results = np.zeros((len(sequences), dimension)) 
    for i, sequence in enumerate(sequences): 
        results[i, sequence] = 1. 
    return results 

x_train = vectorize_sequences(train_data) 
x_test = vectorize_sequences(test_data)
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')


- train_data에는 영화평이 25,000개가 있으므로 x_train은 25,000x10,000 크기의 행렬이 됨
- x_train의 각 원소에는 그 review에서 해당 단어의 사용여부에 따라 1 또는 0의 값이 저장됨


In [7]:
print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


In [8]:
x_train[0] 

array([0., 1., 1., ..., 0., 0., 0.])

### 4. 신경망 구조

In [9]:
from keras import models 
from keras import layers

model = models.Sequential() 
model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) 
model.add(layers.Dense(16, activation='relu')) 
model.add(layers.Dense(1, activation='sigmoid'))

### 5. 신경망 훈련 구조

In [10]:
model.compile(optimizer='rmsprop', 
              loss='binary_crossentropy', 
              metrics=['accuracy'])

# 모델 훈련
model.fit(x_train, y_train, epochs=6, batch_size=512)
# Evaluation
results = model.evaluate(x_test, y_test)

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


### 6. 신경망 파라미터 숫자
- dense: 16x10,000+16 = 160,016
- dense_1: 16x16+16 = 272
- dense_2: 16+1 = 17


In [11]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                160016    
_________________________________________________________________
dense_1 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 160,305
Trainable params: 160,305
Non-trainable params: 0
_________________________________________________________________
