<a href="https://colab.research.google.com/github/JUN0-LEE/mini-MLpiscine/blob/master/(Keras)intro_to_sparse_data_and_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2017 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Intro to Sparse Data and Embeddings
**Learning Objectives**
- Convert movie-review data to a sparse feature vector
- Implement a sentiment-analysis DNN model using an sparse feature vector
- Implement a sentiment-analysis model using an embedding

## Setup
Let's import our dependencies and download the training and test data.

In [1]:
# install TensorFlow 2.0
!pip install tensorflow==2.0.0-beta1

Collecting tensorflow==2.0.0-beta1
[?25l  Downloading https://files.pythonhosted.org/packages/5d/e8/95d05e5518e7bc671f918d1a6813e532cde4af232ccd40f9bb5586c000b0/tensorflow-2.0.0b1-cp27-cp27mu-manylinux1_x86_64.whl (87.9MB)
[K     |████████████████████████████████| 87.9MB 243kB/s 
Collecting tf-estimator-nightly<1.14.0.dev2019060502,>=1.14.0.dev2019060501 (from tensorflow==2.0.0-beta1)
[?25l  Downloading https://files.pythonhosted.org/packages/32/dd/99c47dd007dcf10d63fd895611b063732646f23059c618a373e85019eb0e/tf_estimator_nightly-1.14.0.dev2019060501-py2.py3-none-any.whl (496kB)
[K     |████████████████████████████████| 501kB 46.0MB/s 
Collecting tb-nightly<1.14.0a20190604,>=1.14.0a20190603 (from tensorflow==2.0.0-beta1)
[?25l  Downloading https://files.pythonhosted.org/packages/c3/df/f15af3319c0094c0c74ca291f10d7b1235196988ab67c11bc09950bb7b07/tb_nightly-1.14.0a20190603-py2-none-any.whl (3.1MB)
[K     |████████████████████████████████| 3.1MB 33.6MB/s 
Installing collected package

In [2]:
from __future__ import print_function

import numpy as np
import tensorflow as tf

print(tf.__version__)

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words=10000)

2.0.0-beta1
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


## Explore the data
Let's explore the format of the dataset before training the model. 

In [3]:
x_train

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
       list([1, 194, 1153, 194, 8255, 78, 228,

In [4]:
x_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 2,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 2,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 2,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,
 5535,
 18,

In [5]:
y_train[0]

1

## Preparing the data


Problem of our data is, they are not tensors. So we are going to change our data into tensors using multi-hot incoding

In [0]:
def vectorize_sequences(sequences, dimension=10000):
  results = np.zeros((len(sequences), dimension))
  for i, sequence in enumerate(sequences):
    results[i, sequence] = 1
  return results

In [0]:
# Transform our training data into multi-hot encoded vector
x_train = vectorize_sequences(x_train)
x_test = vectorize_sequences(x_test)

# Transform our training labels into tensors
y_train = np.asarray(y_train).astype('float32')
y_test = np.asarray(y_test).astype('float32')

## Let's build and train the network
We are going to build simple stack of Dense layers with 'relu' activations. But final layer will use a sigmoid function so that we can classify 1(good) and 0(bad).

In [0]:
def train_dnn_model(
    learning_rate,
    epochs,
    batch_size,
    hidden_units,
    training_examples,
    training_targets,
    callbacks):

  model = tf.keras.models.Sequential()

  for unit in hidden_units:
    model.add(tf.keras.layers.Dense(unit, activation='relu'))
  
  model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

  model.compile(
      optimizer=tf.keras.optimizers.RMSprop(lr=learning_rate, clipnorm=5.),
      loss=tf.keras.losses.binary_crossentropy,
      metrics=[tf.keras.metrics.binary_accuracy])
  
  history = model.fit(training_examples,
            training_targets,
            epochs=epochs,
            batch_size=batch_size)
  
  return model, history

In [9]:
model, history = train_dnn_model(
    learning_rate=0.001,
    epochs=5,
    batch_size=100,
    hidden_units=[64,64],
    training_examples=x_train,
    training_targets=y_train,
    callbacks=[tf.keras.callbacks.TensorBoard(log_dir='intro_to_sparse_data')])

W0628 23:56:47.830127 140105280366464 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_grad.py:1250: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Evaluate your model

Let's evaluate your model.

In [10]:
model.evaluate(x_test, y_test)



[0.5666284610438347, 0.8674]

# Intro to embedding
Now let's try the embedding layer. We are going to build model with an embedding layer.

In [0]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words=10000)

x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, 100)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, 100)

In [0]:
def embedding_dnn_model(
    learning_rate,
    epochs,
    batch_size,
    hidden_units,
    training_examples,
    training_targets,
    callbacks):

  model = tf.keras.models.Sequential()
  
  model.add(tf.keras.layers.Embedding(10000, 64, input_length=100))
  model.add(tf.keras.layers.Flatten())
  
  for unit in hidden_units:
    model.add(tf.keras.layers.Dense(unit, activation='relu'))
  
  model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

  model.compile(
      optimizer=tf.keras.optimizers.RMSprop(lr=learning_rate, clipnorm=5.),
      loss=tf.keras.losses.binary_crossentropy,
      metrics=['acc'])
  
  history = model.fit(training_examples,
            training_targets,
            epochs=epochs,
            batch_size=batch_size)
  
  return model, history

In [15]:
model, history = embedding_dnn_model(
    learning_rate=0.001,
    epochs=5,
    batch_size=100,
    hidden_units=[64],
    training_examples=x_train,
    training_targets=y_train,
    callbacks=[tf.keras.callbacks.TensorBoard(log_dir='intro_to_sparse_data')])

Train on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Evaluate your model
Now let's evaluate your model

In [0]:
model.evaluate(x_test, y_test)



[0.8640255265569687, 0.82656]