#### Setup

In [1]:
import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import keras
import tensorflow as tf
import numpy as np 
from keras import layers




#### Load the data: IMDB movie review sentiment classification
- データをダウンロードして構造を見る

In [2]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
  0 80.2M    0  144k    0     0  66458      0  0:21:05  0:00:02  0:21:03 66571
  0 80.2M    0  480k    0     0   151k      0  0:09:01  0:00:03  0:08:58  151k
  1 80.2M    1 1408k    0     0   341k      0  0:04:00  0:00:04  0:03:56  342k
  4 80.2M    4 3600k    0     0   698k      0  0:01:57  0:00:05  0:01:52  726k
  7 80.2M    7 6176k    0     0  1009k      0  0:01:21  0:00:06  0:01:15 1240k
 10 80.2M   10 9008k    0     0  1265k      0  0:01:04  0:00:07  0:00:57 1808k
 15 80.2M   15 12.1M    0     0  1527k      0  0:00:53  0:00:08  0:00:45 2406k
 19 80.2M   19 15.4M    0     0  1736k      0  0:00

In [4]:
!ls aclImdb
!ls aclImdb/test
!ls aclImdb/train

README
imdb.vocab
imdbEr.txt
test
train
labeledBow.feat
neg
pos
urls_neg.txt
urls_pos.txt
labeledBow.feat
neg
pos
unsup
unsupBow.feat
urls_neg.txt
urls_pos.txt
urls_unsup.txt


In [5]:
# aclImdb/train/pos および aclImdb/train/negフォルダにはテキストファイルがあり、それぞれ1つのレビュー（肯定的または否定的）を表す：
!cat aclImdb/train/pos/6248_7.txt

Being an Austrian myself this has been a straight knock in my face. Fortunately I don't live nowhere near the place where this movie takes place but unfortunately it portrays everything that the rest of Austria hates about Viennese people (or people close to that region). And it is very easy to read that this is exactly the directors intention: to let your head sink into your hands and say "Oh my god, how can THAT be possible!". No, not with me, the (in my opinion) totally exaggerated uncensored swinger club scene is not necessary, I watch porn, sure, but in this context I was rather disgusted than put in the right context.<br /><br />This movie tells a story about how misled people who suffer from lack of education or bad company try to survive and live in a world of redundancy and boring horizons. A girl who is treated like a whore by her super-jealous boyfriend (and still keeps coming back), a female teacher who discovers her masochism by putting the life of her super-cruel "lover" 

In [6]:
# posとnegのサブフォルダーにしか興味がないので、テキストファイルが入っている他のサブフォルダーを削除しよう：
!rm -r aclImdb/train/unsup

- <span style="color: red;">keras.utils.text_dataset_from_directory</span>ユーティリティを使用すると、クラス固有のフォルダにファイリングされたディスク上のテキストファイルのセットから、ラベル付きの<span style="color: red;">tf.data.Dataset</span>オブジェクトを生成できます。
- これを使って、トレーニング、検証、テストのデータセットを生成してみよう。検証用データセットとトレーニング用データセットは、<span style="color: red;">train</span>ディレクトリの2つのサブセットから生成され、サンプルの20%が検証用データセットに、80%がトレーニング用データセットになります。
  
- テストデータセットに加えて検証データセットを持つことは、テストデータセットを使うべきでないモデルアーキテクチャなどのハイパーパラメータをチューニングするのに便利である。
- しかし、モデルを実世界に出す前に、（検証データセットを作成せずに）利用可能なすべての訓練データを使って再学習させ、その性能を最大にする必要がある。
- <span style="color: red;">validation_split</span>と<span style="color: red;">subset</span>引数を使う場合は、ランダムなシードを指定するか、<span style="color: red;">shuffle=False</span>を渡すようにして、検証用とトレーニング用の分割が重ならないようにしてください。

In [8]:
batch_size = 32
raw_train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=1337,
)
raw_val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=1337,
)
raw_test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test",
    batch_size=batch_size,
)

print(f"Number of batches in raw_train_ds: {raw_train_ds.cardinality()}")
print(f"Number of batches in raw_val_ds: {raw_val_ds.cardinality()}")
print(f"Number of batches in raw_test_ds: {raw_test_ds.cardinality()}")

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.
Number of batches in raw_train_ds: 625
Number of batches in raw_val_ds: 157
Number of batches in raw_test_ds: 782


- いくつかサンプルを見てみる
    - 正規化とトークン化が期待通りに機能しているかどうか確認するために、生データをチェックすることは大切。

In [11]:
for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(10):
        print(text_batch.numpy()[i])
        print(label_batch.numpy()[i])

b'I rented this horrible movie. The worst think I have ever seen. I believe a 1st grade class could have done a better job. The worse film I have ever seen and I have seen some bad ones. Nothing scary except I paid 1.50 to rent it and that was 1.49 too much. The acting is horrible, the characters are worse and the film is just a piece of trash. The slauther house scenes are so low budget that it makes a B movied look like an Oscar candidate. All I can say is if you wnat to waste a good evening and a little money go rent this horrible flick. I would rather watch killer clowns from outer space while sitting in a bucket of razors than sit through this flop again'
0
b"I spent almost two hours watching a movie that I thought, with all the good actors in it, would be worth watching. I couldn't believe it when the movie ended and I had absolutely no idea what had happened.....I was mad because I could have used that time doing something else....I tried to figure it all out, but really had no 

#### Prepare the data
- < /br>タグを削除する