# Training a Comment-spam Detection model with TensorFlow Lite Model Maker

## Learning objectives

1. Install TensorFlow Lite Model Maker.
2. Download the data from the Colab server to your device.
3. Use a data loader to make the training data.
4. Build the model.

## Overview

In this lab, you review code created with TensorFlow and TensorFlow Lite Model Maker to create a model with a dataset based on comment spam. The original data is available on Kaggle. It's been gathered into a single CSV file, and cleaned up by removing broken text, markup, repeated words and more. This will make it easier to focus on the model instead of the text.

Each learning objective will correspond to a __#TODO__ in the [student lab notebook](../labs/spam_comments_model_maker.ipynb) -- try to complete that notebook first before reviewing this solution notebook. 


### Install TensorFlow Lite Model Maker

In [1]:
# Install Model maker
!pip install -q tflite-model-maker &> /dev/null

**Note:** After the installation, restart the kernel by clicking **Kernel > Restart kernel > Restart**.

### Import the code

Import the necessary dependencies and check the Tensorflow version:

In [1]:
# Imports and check that we are using TF2.x
import numpy as np
import os

from tflite_model_maker import configs
from tflite_model_maker import ExportFormat
from tflite_model_maker import model_spec
from tflite_model_maker import text_classifier
from tflite_model_maker.text_classifier import DataLoader

import tensorflow as tf
assert tf.__version__.startswith('2')
tf.get_logger().setLevel('ERROR')

### Download the dataset

Download the dataset as a CSV and store as data_file

In [2]:
# Download the dataset as a CSV and store as data_file
data_file = tf.keras.utils.get_file(fname='comment-spam.csv', origin='https://storage.googleapis.com/laurencemoroney-blog.appspot.com/lmblog_comments.csv', extract=False)

Downloading data from https://storage.googleapis.com/laurencemoroney-blog.appspot.com/lmblog_comments.csv


### Use a model spec from model maker

In [3]:
# Use a model spec from model maker. Options are 'mobilebert_classifier', 'bert_classifier' and 'average_word_vec'
# The first 2 use the BERT model, which is accurate, but larger and slower to train
# Average Word Vec is kinda like transfer learning where there are pre-trained word weights
# and dictionaries
spec = model_spec.get('average_word_vec')
spec.num_words = 2000
spec.seq_len = 20
spec.wordvec_dim = 7

In [4]:
# Load the CSV using DataLoader.from_csv to make the training_data
data = DataLoader.from_csv(
      filename=data_file,
      text_column='commenttext',
      label_column='spam',
      model_spec=spec,
      delimiter=',',
      shuffle=True,
      is_training=True)

train_data, test_data = data.split(0.9)

### Build the model

In [5]:
# Build the model
model = text_classifier.create(train_data, model_spec=spec, epochs=50, validation_data=test_data)

Epoch 2/2
Epoch 3/3
Epoch 4/4
Epoch 5/5
Epoch 6/6
Epoch 7/7
Epoch 8/8
Epoch 9/9
Epoch 10/10
Epoch 11/11
Epoch 12/12
Epoch 13/13
Epoch 14/14
Epoch 15/15
Epoch 16/16
Epoch 17/17
Epoch 18/18
Epoch 19/19
Epoch 20/20
Epoch 21/21
Epoch 22/22
Epoch 23/23
Epoch 24/24
Epoch 25/25
Epoch 26/26
Epoch 27/27
Epoch 28/28
Epoch 29/29
Epoch 30/30
Epoch 31/31
Epoch 32/32
Epoch 33/33
Epoch 34/34
Epoch 35/35
Epoch 36/36
Epoch 37/37
Epoch 38/38
Epoch 39/39
Epoch 40/40
Epoch 41/41
Epoch 42/42
Epoch 43/43
Epoch 44/44
Epoch 45/45
Epoch 46/46
Epoch 47/47
Epoch 48/48
Epoch 49/49
Epoch 50/50


In [6]:
loss, accuracy = model.evaluate(train_data)



### Export a model

Export a model to SavedModel format with the model, vocabulary and labels.

In [7]:
# This will export to SavedModel format with the model, vocabulary and labels.
model.export(export_dir='/mm_spam_savedmodel/', export_format=[ExportFormat.LABEL, ExportFormat.VOCAB, ExportFormat.SAVED_MODEL])

In [8]:
# Rename the SavedModel subfolder to a version number
!mv /mm_spam_savedmodel/saved_model /mm_spam_savedmodel/123
!zip -r mm_spam_savedmodel.zip /mm_spam_savedmodel/

  adding: mm_spam_savedmodel/ (stored 0%)
  adding: mm_spam_savedmodel/vocab.txt (deflated 47%)
  adding: mm_spam_savedmodel/labels.txt (stored 0%)
  adding: mm_spam_savedmodel/123/ (stored 0%)
  adding: mm_spam_savedmodel/123/saved_model.pb (deflated 87%)
  adding: mm_spam_savedmodel/123/variables/ (stored 0%)
  adding: mm_spam_savedmodel/123/variables/variables.data-00000-of-00001 (deflated 35%)
  adding: mm_spam_savedmodel/123/variables/variables.index (deflated 59%)
  adding: mm_spam_savedmodel/123/assets/ (stored 0%)
  adding: mm_spam_savedmodel/123/keras_metadata.pb (deflated 86%)


In [9]:
# Optional extra
# You can use this cell to export details for projector.tensorflow.org
# Where you can explore the embeddings that were learned for this dataset
embeddings = model.model.layers[0]
weights = embeddings.get_weights()[0]
tokenizer = model.model_spec.vocab

import io

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word in tokenizer:
  #word = tokenizer.decode([word_num])
  value = tokenizer[word]
  embeddings = weights[value]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()


try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download('vecs.tsv')
  files.download('meta.tsv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>