<a href="https://colab.research.google.com/github/LxYuan0420/eat_tensorflow2_in_30_days/blob/master/notebooks/1_3_Example_Modeling_Procedure_for_Texts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [2]:
%cd "/gdrive/MyDrive/Colab Notebooks/git/eat_tensorflow2_in_30_days/notebooks"

/gdrive/MyDrive/Colab Notebooks/git/eat_tensorflow2_in_30_days/notebooks


In [3]:
!cat "../data/imdb/test.csv" | head -5

1	The first one meant victory. This one means defeat. It takes place in a Bolivia, there the guerillas are sick and wary and don't meet that much sympathy from the farmers. If you know your 60s history, you understand how it ends. You will understand it even without that knowledge.<br /><br />Del Toro is once again splendid. He goes on building this icon about the revolutionary who remains the same, regardless of success or failure. That's what Guevara is according to the legend, but still it's so well acted.<br /><br />The documentary feeling is there around the icon, which is one of the greatest achievements in this big Soderbergh project. He has succeeded.
1	Excellent movie, a realistic picture of contemporary Finland, touching and profound. One of the best Finnish films ever made. Captures marvelously the everyday life in a Central Finland small town, people's desires and weaknesses, joys and sorrows. The bright early fall sunshine creates a cool atmosphere to this lucid examinati

**1. Data Preparation**

The purpose of the imdb dataset is to predict the sentiment label according to the movie reviews.

There are 20000 text reviews in the training dataset and 5000 in the testing dataset, with half positive and half negative, respectively.

The pre-processing of the text dataset is a little bit complex, which includes word division (for Chinese only, not relevant to this demonstration), dictionary construction, encoding, sequence filling, and data pipeline construction, etc.

There are two popular mothods of text preparation in TensorFlow.

The first one is constructing the text data generator using Tokenizer in tf.keras.preprocessing, together with tf.keras.utils.Sequence.

The second one is using tf.data.Dataset, together with the pre-processing layer tf.keras.layers.experimental.preprocessing.TextVectorization.

The former is more complex and is demonstrated here.

The latter is the original method of TensorFlow, which is simpler.

Below is the introduction to the second method.

In [4]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import tensorflow as tf
import re, string

In [5]:
MAX_WORDS = 10000
MAX_LEN = 200
BATCH_SIZE = 20

train_data_path = "../data/imdb/train.csv"
test_data_path = "../data/imdb/test.csv"

In [6]:
def split_line(line):
    arr = tf.strings.split(line, "\t")
    label = tf.expand_dims(tf.cast(tf.strings.to_number(arr[0]), tf.int32), axis=0)
    text = tf.expand_dims(arr[1], axis=0)
    return (text, label)

def clean_text(text):
    lowercase = tf.strings.lower(text)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    clean_punct = tf.strings.regex_replace(stripped_html, "[%s]" % re.escape(string.punctuation), "")
    return clean_punct

In [7]:
ds_train_raw =  tf.data.TextLineDataset(filenames = [train_data_path]) \
   .map(split_line,num_parallel_calls = tf.data.experimental.AUTOTUNE) \
   .shuffle(buffer_size = 1000).batch(BATCH_SIZE) \
   .prefetch(tf.data.experimental.AUTOTUNE)

ds_test_raw = tf.data.TextLineDataset(filenames = [test_data_path]) \
   .map(split_line,num_parallel_calls = tf.data.experimental.AUTOTUNE) \
   .batch(BATCH_SIZE) \
   .prefetch(tf.data.experimental.AUTOTUNE)


In [8]:
vectorize_layer = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=MAX_WORDS-1, # leave one item for the placeholder
    standardize=clean_text,
    split="whitespace", 
    output_mode='int',
    output_sequence_length=MAX_LEN 
)

# vectorize "fit" on train text
ds_text = ds_train_raw.map(lambda text, label: text)
vectorize_layer.adapt(ds_text)
print(vectorize_layer.get_vocabulary()[:100])

['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i', 'this', 'that', 'was', 'as', 'for', 'with', 'movie', 'but', 'film', 'on', 'not', 'you', 'his', 'are', 'have', 'be', 'he', 'one', 'its', 'at', 'all', 'by', 'an', 'they', 'from', 'who', 'so', 'like', 'her', 'just', 'or', 'about', 'has', 'if', 'out', 'some', 'there', 'what', 'good', 'more', 'when', 'very', 'she', 'even', 'my', 'no', 'would', 'up', 'time', 'only', 'which', 'story', 'really', 'their', 'were', 'had', 'see', 'can', 'me', 'than', 'we', 'much', 'well', 'get', 'been', 'will', 'into', 'people', 'also', 'other', 'do', 'bad', 'because', 'great', 'first', 'how', 'him', 'most', 'dont', 'made', 'then', 'them', 'films', 'movies', 'way', 'make', 'could', 'too', 'any']


In [9]:
sentences = ["this is sentence 1",
             "this is sentence 2"]

print(vectorize_layer(sentences))

tf.Tensor(
[[  11    7 4309  468    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0  

In [10]:
#Word encoding
ds_train = ds_train_raw.map(lambda text,label:(vectorize_layer(text),label)) \
    .prefetch(tf.data.experimental.AUTOTUNE)
ds_test = ds_test_raw.map(lambda text,label:(vectorize_layer(text),label)) \
    .prefetch(tf.data.experimental.AUTOTUNE)

**2. Model Definition**

Usually there are three ways of modeling using APIs of Keras: sequential modeling using Sequential() function, arbitrary modeling using functional API, and customized modeling by inheriting base class Model.

In this example, we use customized modeling by inheriting base class Model.

In [11]:
class CnnModel(tf.keras.Model):
    def __init__(self):
        super(CnnModel, self).__init__()
        
    def build(self,input_shape):
        self.embedding = tf.keras.layers.Embedding(MAX_WORDS,7,input_length=MAX_LEN)
        self.conv_1 = tf.keras.layers.Conv1D(16, kernel_size= 5,name = "conv_1",activation = "relu")
        self.pool_1 = tf.keras.layers.MaxPool1D(name = "pool_1")
        self.conv_2 = tf.keras.layers.Conv1D(128, kernel_size=2,name = "conv_2",activation = "relu")
        self.pool_2 = tf.keras.layers.MaxPool1D(name = "pool_2")
        self.flatten = tf.keras.layers.Flatten()
        self.dense = tf.keras.layers.Dense(1,activation = "sigmoid")
        super(CnnModel,self).build(input_shape)
    
    def call(self, x):
        x = self.embedding(x)
        x = self.conv_1(x)
        x = self.pool_1(x)
        x = self.conv_2(x)
        x = self.pool_2(x)
        x = self.flatten(x)
        x = self.dense(x)
        return(x)
    
    # To show Output Shape
    def summary(self):
        x_input = tf.keras.layers.Input(shape = (MAX_LEN,))
        output = self.call(x_input)
        model = tf.keras.Model(inputs = x_input,outputs = output)
        model.summary()

In [12]:
model = CnnModel()
model.build(input_shape =(None,MAX_LEN))
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 200)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 200, 7)            70000     
_________________________________________________________________
conv_1 (Conv1D)              (None, 196, 16)           576       
_________________________________________________________________
pool_1 (MaxPooling1D)        (None, 98, 16)            0         
_________________________________________________________________
conv_2 (Conv1D)              (None, 97, 128)           4224      
_________________________________________________________________
pool_2 (MaxPooling1D)        (None, 48, 128)           0         
_________________________________________________________________
flatten (Flatten)            (None, 6144)              0     

**3. Model Training**

There are three usual ways for model training: use internal function fit, use internal function train_on_batch, and customized training loop. Here we use the customized training loop.

In [22]:
#Time Stamp
@tf.function
def printbar():
    ts = tf.timestamp()
    today_ts = tf.timestamp()%(23*60*60)

    hour = tf.cast(today_ts//3600+8, tf.int32)%tf.constant(24)
    minute = tf.cast((today_ts%3600)//60, tf.int32)
    second = tf.cast(tf.floor(today_ts%60), tf.int32)

    def timeformat(m):
        if tf.strings.length(tf.strings.format("{}", m)) == 1:
            return (tf.strings.format("0{}", m))
        else:
            return (tf.strings.format("{}", m))

    timestring = tf.strings.join([timeformat(hour),timeformat(minute),
                timeformat(second)],separator = ":")
    tf.print("=========="*8+timestring)


In [25]:
optimizer = tf.keras.optimizers.Nadam()
loss_func = tf.keras.losses.BinaryCrossentropy()

train_loss = tf.keras.metrics.Mean(name='train_loss')
train_metric = tf.keras.metrics.BinaryAccuracy(name='train_accuracy')

valid_loss = tf.keras.metrics.Mean(name='valid_loss')
valid_metric = tf.keras.metrics.BinaryAccuracy(name='valid_accuracy')

@tf.function
def train_step(model, features, labels):
    with tf.GradientTape() as tape:
        predictions = model(features, training=True)
        loss = loss_func(labels, predictions)
    
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_loss.update_state(loss)
    train_metric.update_state(labels, predictions)

@tf.function
def valid_step(model, features, labels):
    predictions = model(features, training=False)
    batch_loss = loss_func(labels, predictions)
    valid_loss.update_state(batch_loss)
    valid_metric.update_state(labels, predictions)

def train_model(model, ds_train, ds_valid, epochs):
    for epoch in tf.range(1, epochs+1):

        for features, labels in ds_train:
            train_step(model, features, labels)

        for features, labels in ds_valid:
            valid_step(model, features, labels)
        
        logs = "Epoch={}, Loss:{}, Accuracy:{}, Valid_loss: {}, Valid_accuracy:{}"

        if epoch%1==0:
            printbar()
            tf.print(tf.strings.format(logs,
                                       (epoch, train_loss.result(), train_metric.result(), valid_loss.result(), valid_metric.result())))
            tf.print("")


        train_loss.reset_states()
        valid_loss.reset_states()
        train_metric.reset_states()
        valid_metric.reset_states()
        

In [26]:
train_model(model, ds_train, ds_test, epochs=6)

Epoch=1, Loss:0.137153059, Accuracy:0.94915, Valid_loss: 0.402908385, Valid_accuracy:0.8674

Epoch=2, Loss:0.0935169086, Accuracy:0.9676, Valid_loss: 0.541632116, Valid_accuracy:0.863

Epoch=3, Loss:0.0567635931, Accuracy:0.9803, Valid_loss: 0.758906305, Valid_accuracy:0.8482

Epoch=4, Loss:0.0327552296, Accuracy:0.98975, Valid_loss: 0.860463619, Valid_accuracy:0.8538

Epoch=5, Loss:0.0190788414, Accuracy:0.99385, Valid_loss: 1.03909099, Valid_accuracy:0.8528

Epoch=6, Loss:0.0168200135, Accuracy:0.99355, Valid_loss: 1.14302337, Valid_accuracy:0.8542



**4. Model Evaluation**

The model trained by the customized looping is not compiled, so the method `model.evaluate(ds_valid)` cant be applied directly.

In [29]:
def evaluate_model(model,ds_valid):
    for features, labels in ds_valid:
         valid_step(model,features,labels)
    logs = 'Valid Loss:{},Valid Accuracy:{}' 
    tf.print(tf.strings.format(logs,(valid_loss.result(),valid_metric.result())))
    
    valid_loss.reset_states()
    train_metric.reset_states()
    valid_metric.reset_states()

In [30]:
evaluate_model(model, ds_test)

Valid Loss:1.14302325,Valid Accuracy:0.8542


**5. Model Saving**

Model saving with the original wy of TensorFlow is recommended.

In [31]:
%cd "/gdrive/MyDrive/Colab Notebooks/git/eat_tensorflow2_in_30_days/notebooks"

/gdrive/MyDrive/Colab Notebooks/git/eat_tensorflow2_in_30_days/notebooks


In [32]:
model.save("../model_weights/imdb_model", save_format="tf")
print('export saved model')

INFO:tensorflow:Assets written to: ../model_weights/imdb_model/assets
export saved model


In [34]:
model_loaded = tf.keras.models.load_model("../model_weights/imdb_model")
evaluate_model(model_loaded, ds_test)

Valid Loss:1.01482117,Valid Accuracy:0.8542


In [35]:
%cd "../model_weights/"
!rm -r imdb_model

/gdrive/MyDrive/Colab Notebooks/git/eat_tensorflow2_in_30_days/model_weights
