<img width="319" alt="텐서3" src="https://user-images.githubusercontent.com/74411831/115780666-63fffa80-a3f4-11eb-8db4-f53c5d877400.png">

One-hot encoding : index에 1, 나머지 0 (희소행렬)

+ 차원이 큼(자원낭비가 심함, 차원 = word수)
+ 유사도를 표현할 수 없음

<img width="319" alt="텐서4" src="https://user-images.githubusercontent.com/74411831/115780668-65312780-a3f4-11eb-8223-0c46363e0b1f.png">

![텐서2](https://user-images.githubusercontent.com/74411831/115780661-62cecd80-a3f4-11eb-9b87-7965ffc79e6d.png)

In [1]:
import io
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

![텐서1](https://user-images.githubusercontent.com/74411831/115780651-5fd3dd00-a3f4-11eb-9f2a-cbf251ebe574.png)

In [2]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                  untar=True, cache_dir='.',
                                  cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
os.listdir(dataset_dir)

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


['imdb.vocab', 'imdbEr.txt', 'README', 'test', 'train']

In [3]:
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

['labeledBow.feat',
 'neg',
 'pos',
 'unsup',
 'unsupBow.feat',
 'urls_neg.txt',
 'urls_pos.txt',
 'urls_unsup.txt']

In [4]:
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir) 

In [5]:
batch_size = 1024 # 배치사이즈 1024
seed = 123
train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2,
    subset='training', seed=seed)
val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2,
    subset='validation', seed=seed)

# 80% trainining / 20% validation

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [6]:
for text_batch, label_batch in train_ds.take(1):
  for i in range(5):
    print(label_batch[i].numpy(), text_batch.numpy()[i])
    # 1번째 배치(1024)에서 5개 출력해보기

0 b"Oh My God! Please, for the love of all that is holy, Do Not Watch This Movie! It it 82 minutes of my life I will never get back. Sure, I could have stopped watching half way through. But I thought it might get better. It Didn't. Anyone who actually enjoyed this movie is one seriously sick and twisted individual. No wonder us Australians/New Zealanders have a terrible reputation when it comes to making movies. Everything about this movie is horrible, from the acting to the editing. I don't even normally write reviews on here, but in this case I'll make an exception. I only wish someone had of warned me before I hired this catastrophe"
1 b'This movie is SOOOO funny!!! The acting is WONDERFUL, the Ramones are sexy, the jokes are subtle, and the plot is just what every high schooler dreams of doing to his/her school. I absolutely loved the soundtrack as well as the carefully placed cynicism. If you like monty python, You will love this film. This movie is a tad bit "grease"esk (without

# **Embedding layer**

In [58]:
embedding_layer = tf.keras.layers.Embedding(1000, 5) # bocab = 1000, 5 dimension

**1번 example**

In [8]:
result = embedding_layer(tf.constant([1, 2, 3]))

In [9]:
result.shape

TensorShape([3, 5])

In [10]:
result.numpy()

array([[ 0.04911261, -0.01548098,  0.01584821, -0.0473286 ,  0.02925647],
       [-0.03795327,  0.04436849, -0.03938953, -0.02252357, -0.03000053],
       [ 0.0018615 , -0.03280344, -0.02480272, -0.04515153, -0.048002  ]],
      dtype=float32)

**2번 example**

In [11]:
result = embedding_layer(tf.constant([[0, 1, 2], [3, 4, 5]]))

In [12]:
result.shape

TensorShape([2, 3, 5])

In [13]:
result.numpy()

array([[[-0.03089815, -0.04239283, -0.04515635,  0.00618646,
         -0.04136548],
        [ 0.04911261, -0.01548098,  0.01584821, -0.0473286 ,
          0.02925647],
        [-0.03795327,  0.04436849, -0.03938953, -0.02252357,
         -0.03000053]],

       [[ 0.0018615 , -0.03280344, -0.02480272, -0.04515153,
         -0.048002  ],
        [ 0.04804472, -0.00329263,  0.03378585, -0.02984965,
         -0.01211827],
        [ 0.02109848, -0.04386741, -0.01033193,  0.03365595,
         -0.0202814 ]]], dtype=float32)

# **분류 모델 만들어보기**

**전처리**

In [14]:
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
    # <br /> 을 공백으로 바꿔줌
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '') # 구두점 같은 것도 공백으로 바꿔줌

vocab_size = 10000
sequence_length = 100

vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

text_ds = train_ds.map(lambda x, y: x) # 부정, 긍정의 label 0, 1은 사용하지 않음
vectorize_layer.adapt(text_ds)

In [15]:
for x in text_ds.take(1):
    for i in range(5):
        print(x[i].numpy())

b"The original animated Dark Knight returns in this ace adventure movie that rivals Mask of Phantasm in its coolness. There's a lot of style and intelligence in Mystery of the Batwoman, so much more than Batman Forever or Batman and Robin.<br /><br />There's a new crime-fighter on the streets of Gotham. She dresses like a bat but she's not a grown-up Batgirl. And Batman is denying any affiliation with her. Meanwhile Bruce Wayne has to deal with the usual romances and detective work. But the Penguin, Bain and the local Mob makes things little more complicated.<br /><br />I didn't have high hopes for this 'un since being strongly let down but the weak Batman: Sub Zero (Robin isn't featured so much here!)but I was delighted with the imaginative and exciting set pieces, the clever plot and a cheeky sense of humor. This is definitely a movie no fan of Batman should be without. Keep your ears open for a really catchy song called 'Betcha Neva' which is featured prominently through-out.<br /><

In [16]:
print(vectorize_layer.get_vocabulary())



TextVectorization 설명은 https://dodonam.tistory.com/188 를 참조함.

**base model**

In [17]:
embedding_dim=16

model = Sequential([
  vectorize_layer,
  Embedding(vocab_size, embedding_dim, name="embedding"),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'),
  Dense(1)
])

In [19]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [20]:
model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15,
    callbacks=[tensorboard_callback])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x1cd864e6dc0>

In [21]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, 100)               0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 16)           160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________


**base model의 성능을 높여보기**

+layer추가, dropout추가, 30epoch

In [49]:
embedding_dim=32

model = keras.models.Sequential([
    vectorize_layer,
    Embedding(vocab_size, embedding_dim, name="embedding"),
    GlobalAveragePooling1D(),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(1)
])

In [50]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [51]:
model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x1cd88050fa0>

# **<font color=red>실패</font>**

+dropout

In [52]:
embedding_dim=16

model = keras.models.Sequential([
    vectorize_layer,
    Embedding(vocab_size, embedding_dim, name="embedding"),
    GlobalAveragePooling1D(),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(1)
])

In [53]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [55]:
model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x1cd8805ff40>

# **<font color=blue>성공</font>**