<a href="https://colab.research.google.com/github/Anjasfedo/Learning-TensorFlow/blob/main/eat_tensorflow2_in_30_days/Chapter1_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1-3 Example: Modeling Procedure for Texts

## 1. Data Preparation

The purpose of imdb dataset is to predict setiment label according to movie reviews.

There 20000 text reviews in train dataset and 5000 in test datase, half positive and negative, respectively.

The pre-processing of text dataset kinda complex, which include word devision (for chinese only, not relevant on this demo), dictionary construction, encoding, sequence filling, and data pipeline construction, etc.

There is two popular method of text preparation in TensorFlow:
1. construct text data generator using Tokenizer in `tf.keras.preprocessing`, together with `tf.kears.utils.Sequence`.
2. with `tf.data.Dataset`, together with pre-processing layer `tf.keras.experimental.preprocessing.TextVectorization`

Here is the second method

In [41]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as py
import tensorflow as tf
from tensorflow.keras import models, layers, preprocessing, optimizers, losses, metrics
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import re, string, os

In [42]:
base_url = "https://raw.githubusercontent.com/lyhue1991/eat_tensorflow2_in_30_days/master/data/imdb/"

train_filename = "train.csv"
test_filename = "test.csv"

train_url = base_url + train_filename
test_url = base_url + test_filename

train_data_path = tf.keras.utils.get_file(train_filename, origin=train_url, cache_dir='.', cache_subdir='data')
test_data_path = tf.keras.utils.get_file(test_filename, origin=test_url, cache_dir='.', cache_subdir='data')

print(f"Train data downloaded to: {train_data_path}")
print(f"Test data downloaded to: {test_data_path}")

print(f"Train data exists: {os.path.exists(train_data_path)}, Size: {os.path.getsize(train_data_path) / 1024:.2f} KB")
print(f"Test data exists: {os.path.exists(test_data_path)}, Size: {os.path.getsize(test_data_path) / 1024:.2f} KB")

Downloading data from https://raw.githubusercontent.com/lyhue1991/eat_tensorflow2_in_30_days/master/data/imdb/train.csv
Downloading data from https://raw.githubusercontent.com/lyhue1991/eat_tensorflow2_in_30_days/master/data/imdb/test.csv
Train data downloaded to: ./data/train.csv
Test data downloaded to: ./data/test.csv
Train data exists: True, Size: 26058.23 KB
Test data exists: True, Size: 6482.65 KB


In [57]:
MAX_WORDS = 10000 # consider the 10000 words with highest frequency of appearence
MAX_LEN = 200 # each sample, preserve the first 200 words
BATCH_SIZE = 32

In [51]:
# Construct data pipeline
def split_line(line):
  arr = tf.strings.split(line, sep='\t')
  label = tf.expand_dims(tf.cast(tf.strings.to_number(arr[0]), tf.int32), axis=0)
  text = tf.expand_dims(arr[1], axis=0)
  return (text, label)

In [52]:
ds_train_raw = tf.data.TextLineDataset(filenames=[train_data_path]) \
                .map(split_line, num_parallel_calls=tf.data.experimental.AUTOTUNE) \
                .shuffle(buffer_size=10000) \
                .batch(BATCH_SIZE) \
                .prefetch(tf.data.experimental.AUTOTUNE)

ds_test_raw = tf.data.TextLineDataset(filenames=[test_data_path]) \
                .map(split_line, num_parallel_calls=tf.data.experimental.AUTOTUNE) \
                .batch(BATCH_SIZE) \
                .prefetch(tf.data.experimental.AUTOTUNE)

In [53]:
# Construct dictionary
def clean_text(text):
  lowercase = tf.strings.lower(text)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  cleaned_punctuation = tf.strings.regex_replace(stripped_html, '[%s]' % re.escape(string.punctuation), '')
  return cleaned_punctuation

In [54]:
vectorize_layer = TextVectorization(
    standardize=clean_text,
    split='whitespace',
    max_tokens=MAX_WORDS,
    output_mode='int',
    output_sequence_length=MAX_len
)

ds_text = ds_train_raw.map(lambda text, label: text)
vectorize_layer.adapt(ds_text)
print(vectorize_layer.get_vocabulary()[0:100])

['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i', 'this', 'that', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'on', 'not', 'you', 'his', 'are', 'have', 'be', 'he', 'one', 'its', 'at', 'all', 'by', 'an', 'they', 'from', 'who', 'so', 'like', 'her', 'just', 'or', 'about', 'has', 'if', 'out', 'some', 'there', 'what', 'good', 'more', 'when', 'very', 'she', 'even', 'my', 'no', 'would', 'up', 'time', 'only', 'which', 'story', 'really', 'their', 'were', 'had', 'see', 'can', 'me', 'than', 'we', 'much', 'well', 'get', 'been', 'will', 'into', 'people', 'also', 'other', 'do', 'bad', 'because', 'great', 'first', 'how', 'him', 'most', 'dont', 'made', 'then', 'them', 'films', 'movies', 'way', 'make', 'could', 'too', 'any']


In [56]:
# Word encoding
ds_train = ds_train_raw.map(lambda text, label: (vectorize_layer(text), label)) \
            .prefetch(tf.data.experimental.AUTOTUNE)
ds_test = ds_test_raw.map(lambda text, label: (vectorize_layer(text), label)) \
            .prefetch(tf.data.experimental.AUTOTUNE)

## 2. Model Definition

Here is the way to customized modeling by inherit base class `Model`

In [107]:
# Actually, modeling with sequential() or API functions should be priorized.

tf.keras.backend.clear_session()

class CnnModel(models.Model):
  def __init__(self):
    super(CnnModel, self).__init__()

  def build(self, input_shape):
    self.embedding = layers.Embedding(MAX_WORDS, 7, input_length=MAX_LEN)
    self.conv_1 = layers.Conv1D(16, kernel_size=5, name='conv_1', activation='relu')
    self.pool_1 = layers.MaxPool1D(name='pool_1')
    self.conv_2 = layers.Conv1D(128, kernel_size=2, name='conv_2', activation='relu')
    self.pool_2 = layers.MaxPool1D(name='pool_2')
    self.flatten = layers.Flatten()
    self.dense = layers.Dense(1, activation='sigmoid')
    super(CnnModel,self).build(input_shape)

  def call(self, inputs):
    x = self.embedding(inputs)
    x = self.conv_1(x)
    x = self.pool_1(x)
    x = self.conv_2(x)
    x = self.pool_2(x)
    x = self.flatten(x)
    x = self.dense(x)
    return (x)

  def summary(self):
    x_input = layers.Input(shape=(MAX_LEN,), dtype=tf.int32)
    output = self.call(x_input)
    model = models.Model(inputs=x_input, outputs=output)
    model.summary()

model = CnnModel()
model.build(input_shape=(None, MAX_LEN))
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 200)]             0         
                                                                 
 embedding (Embedding)       (None, 200, 7)            70000     
                                                                 
 conv_1 (Conv1D)             (None, 196, 16)           576       
                                                                 
 pool_1 (MaxPooling1D)       (None, 98, 16)            0         
                                                                 
 conv_2 (Conv1D)             (None, 97, 128)           4224      
                                                                 
 pool_2 (MaxPooling1D)       (None, 48, 128)           0         
                                                                 
 flatten (Flatten)           (None, 6144)              0     

## 3. Model Training

Here the customized training loop method

In [108]:
# Time stamp
@tf.function
def printbar():
  ts = tf.timestamp()
  today_ts = tf.timestamp()%(24*60*60)

  hour = tf.cast(tf.floor(today_ts/3600), tf.int32)
  minute = tf.cast(tf.floor((today_ts%3600)/60), tf.int32)
  second = tf.cast(tf.floor(today_ts%60), tf.int32)

  def timeformat(m):
    if tf.strings.length(tf.strings.format('{}', m)) == 1:
      return tf.strings.format('0{}', m)
    else:
      return tf.strings.format('{}', m)

  timestring = tf.strings.join([timeformat(hour), timeformat(minute), timeformat(second)], separator=':')

  tf.print('========'*8 + '\n' + timestring)

In [109]:
optimizer = optimizers.Nadam()
loss_fn = losses.BinaryCrossentropy()

In [110]:
train_loss = metrics.Mean(name='train_loss')
train_metric = metrics.BinaryAccuracy(name='train_accruracy')

valid_loss = metrics.Mean(name='valid_loss')
valid_metric = metrics.BinaryAccuracy(name='valid_accruracy')

In [111]:
@tf.function
def train_step(model, features, labels):
  with tf.GradientTape() as tape:
    predictions = model(features, training=True)
    loss = loss_fn(labels, predictions)

  gradients = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(gradients, model.trainable_variables))

  train_loss.update_state(loss)
  train_metric.update_state(labels, predictions)

@tf.function
def valid_step(model, features, labels):
  predictions = model(features, training=False)
  batch_loss = loss_fn(labels, predictions)

  valid_loss.update_state(batch_loss)
  valid_metric.update_state(labels, predictions)

In [112]:
def train_model(model, ds_train, ds_valid, epochs):
  for epoch in tf.range(1, epochs + 1):

    for features, labels in ds_train:
      train_step(model, features, labels)

    for features, labels in ds_valid:
      valid_step(model, features, labels)

    logs = 'Epoch={}, Loss={}, Accuracy={}, Valid Loss={}, Valid Accuracy={}'

    if epoch % 1 == 0:
                print("=" * 50)
                print(f"Epoch {epoch}/{epochs}, "
                        f"Loss: {train_loss.result():.4f}, "
                        f"Accuracy: {train_metric.result():.4f}, "
                        f"Valid Loss: {valid_loss.result():.4f}, "
                        f"Valid Accuracy: {valid_metric.result():.4f}")
                print("=" * 50)

    train_loss.reset_states()
    train_metric.reset_states()
    valid_loss.reset_states()
    valid_metric.reset_states()

train_model(model, ds_train, ds_test, epochs=10)

Epoch 1/10, Loss: 0.4567, Accuracy: 0.7563, Valid Loss: 0.3198, Valid Accuracy: 0.8678
Epoch 2/10, Loss: 0.2336, Accuracy: 0.9073, Valid Loss: 0.3339, Valid Accuracy: 0.8710
Epoch 3/10, Loss: 0.1561, Accuracy: 0.9406, Valid Loss: 0.3916, Valid Accuracy: 0.8648
Epoch 4/10, Loss: 0.0936, Accuracy: 0.9677, Valid Loss: 0.5181, Valid Accuracy: 0.8554
Epoch 5/10, Loss: 0.0452, Accuracy: 0.9850, Valid Loss: 0.6891, Valid Accuracy: 0.8530
Epoch 6/10, Loss: 0.0199, Accuracy: 0.9944, Valid Loss: 0.9018, Valid Accuracy: 0.8494
Epoch 7/10, Loss: 0.0111, Accuracy: 0.9967, Valid Loss: 1.0939, Valid Accuracy: 0.8498
Epoch 8/10, Loss: 0.0102, Accuracy: 0.9966, Valid Loss: 1.2206, Valid Accuracy: 0.8508
Epoch 9/10, Loss: 0.0131, Accuracy: 0.9954, Valid Loss: 1.3358, Valid Accuracy: 0.8442
Epoch 10/10, Loss: 0.0216, Accuracy: 0.9921, Valid Loss: 1.3089, Valid Accuracy: 0.8480


## 4. Model Evaluation

The trained model by costomized looping is not compiler, so method `model.evaluate()` cant be applied directly

In [117]:
def evaluate_model(model,ds_valid):
    for features, labels in ds_valid:
         valid_step(model,features,labels)
    # logs = 'Valid Loss={},Valid Accuracy={}'
    # tf.print(tf.strings.format(logs,(valid_loss.result(),valid_metric.result())))
    print(f'Valid Loss: {valid_loss.result():.4f}, Valid Accuracy: {valid_metric.result():.4f}')

    valid_loss.reset_states()
    train_metric.reset_states()
    valid_metric.reset_states()

In [118]:
evaluate_model(model, ds_test)

Valid Loss: 1.3089, Valid Accuracy: 0.8480


## 4. Model Application

There are some available methods:
- model.predict()
- model()
- model.call()
- model.predict_on_batch()

recomend to use `model.predict()` method, since it can be applied on both Dataset and Tensor

In [120]:
model.predict(ds_test)



array([[0.99999976],
       [0.9998863 ],
       [0.99853206],
       ...,
       [0.9999656 ],
       [0.68822837],
       [1.        ]], dtype=float32)

In [121]:
for x_test,_ in ds_test.take(1):
    print(model(x_test))
    #Indentical expressions:
    #print(model.call(x_test))
    #print(model.predict_on_batch(x_test))

tf.Tensor(
[[9.9999976e-01]
 [9.9988627e-01]
 [9.9853206e-01]
 [1.6907185e-16]
 [5.0385017e-04]
 [5.3917755e-09]
 [1.0308210e-07]
 [7.3073439e-05]
 [9.9999332e-01]
 [7.2912651e-01]
 [9.8259407e-01]
 [9.9999899e-01]
 [3.6092793e-08]
 [9.9998492e-01]
 [1.1585158e-07]
 [8.3854403e-03]
 [2.8352773e-10]
 [8.2858754e-03]
 [5.1688805e-04]
 [8.9920485e-01]
 [1.8289582e-11]
 [1.0000000e+00]
 [9.9985754e-01]
 [8.1316625e-09]
 [1.0000000e+00]
 [9.9999434e-01]
 [9.8755842e-01]
 [4.6968147e-01]
 [9.9993324e-01]
 [9.9864715e-01]
 [1.6044406e-04]
 [9.9997693e-01]], shape=(32, 1), dtype=float32)


## Model Saving

The originial way to save model in TensorFlow that recommended

In [123]:
model.save('/content/model/tf_model_savedmodel', save_format='tf')
print('export saved model')

export saved model


In [124]:
model_loaded = tf.keras.models.load_model('/content/model/tf_model_savedmodel')
model_loaded.summary()



Model: "cnn_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  70000     
                                                                 
 conv_1 (Conv1D)             multiple                  576       
                                                                 
 pool_1 (MaxPooling1D)       multiple                  0         
                                                                 
 conv_2 (Conv1D)             multiple                  4224      
                                                                 
 pool_2 (MaxPooling1D)       multiple                  0         
                                                                 
 flatten (Flatten)           multiple                  0         
                                                                 
 dense (Dense)               multiple                  61

In [126]:
model_loaded.predict(ds_test)



array([[0.99999976],
       [0.9998863 ],
       [0.99853206],
       ...,
       [0.9999656 ],
       [0.68822837],
       [1.        ]], dtype=float32)