## Build a Spam Classifier with Keras
With deep learning and AI, handling spam content has gotten easier and easier. Over time (and with the aid of direct user feedback) our spam classifier will rarely produce erroneous results.

This is the first part of a multi-part series covering how to:

- Build an AI Model (this one)
- Integrate a NoSQL Database (inference result storing)
- Deploy an AI Model into Production

### Prerequisites
- Prepare your dataset using this notebook .
- Convert your dataset into trainable vectors in this notebook (Either way, this notebook will run this step for us).

### Running this notebook:
- Recommended: Use Colab as it offers free GPUs for training models. Launch this notebook here)

In [16]:
import boto3
import os
import json
import pathlib
import pickle
import pandas as pd

from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding, LSTM, SpatialDropout1D
from tensorflow.keras.models import Model, Sequential

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [17]:
USE_PROJECT_ROOT = True
BASE_DIR = pathlib.Path().resolve()
if USE_PROJECT_ROOT:
    BASE_DIR = BASE_DIR.parent

DATASET_DIR = BASE_DIR / "datasets"
EXPORT_DIR = DATASET_DIR / "exports"
DATASET_CSV_PATH = EXPORT_DIR / 'spam-dataset.csv'

GUIDES_DIR = BASE_DIR / "guides"
TRAINING_DATA_PATH = EXPORT_DIR / 'spam-training-data.pkl'
PART_TWO_GUIDE_PATH = GUIDES_DIR / "02-Convert_Dataset_Into_Vectors.ipynb"

## Prepare Dataset
Creating a dataset rarely happens next to where you run the training. The below cells are a method for us to extract the needed data to perform training against.

```shell
!mkdir -p "$EXPORT_DIR"
!mkdir -p "$GUIDES_DIR"
!curl "https://github.com/KewJS/spam_classification/blob/master/data_local/exports/spam-dataset.csv" -o "$DATASET_CSV_PATH"
!curl "https://github.com/KewJS/spam_classification/blob/master/nbs/02-Convert_Dataset_Into_Vectors.ipynb" -o "$PART_TWO_GUIDE_PATH"
```

In [18]:
df = pd.read_csv(DATASET_CSV_PATH)
df.head()

Unnamed: 0,label,text,source
0,ham,"Go until jurong point, crazy.. Available only ...",sms-spam
1,ham,Ok lar... Joking wif u oni...,sms-spam
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,sms-spam
3,ham,U dun say so early hor... U c already then say...,sms-spam
4,ham,"Nah I don't think he goes to usf, he lives aro...",sms-spam


In [19]:
%run "$PART_TWO_GUIDE_PATH"

BASE_DIR is F:\KEW_JING_SHENG\01-SELF_LEARNING\02-Data_Science\35-Spam_Classification
Random Index 6582
Found 9730 unique tokens.
Done creating tokenized train & test data...


In [20]:
data = {}

with open(TRAINING_DATA_PATH, 'rb') as f:
    data = pickle.load(f)

> While the above code uses <code>pickle</code> to load in data, this data is actually exported via <code>pickle</code> when we execute the <code>%run</code> only a few steps ago. Since <code>pickle</code> can be unsafe to use from third-party downloaded data, we actually generate (again using <code>%run</code>) this pickle data and therefore is safe to use -- it's never downloaded.

## Transform Extracted Dataset

In [21]:
X_test = data['X_test']
X_train = data['X_train']
y_test = data['y_test']
y_train = data['y_train']
labels_legend_inverted = data['labels_legend_inverted']
legend = data['legend']
max_sequence = data['max_sequence']
max_words = data['max_words']
tokenizer = data['tokenizer']

## Create our LSTM Model

In [22]:
embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(MAX_NUM_WORDS, embed_dim, input_length=X_train.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.3, recurrent_dropout=0.3))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 300, 128)          35840     
                                                                 
 spatial_dropout1d_1 (Spatia  (None, 300, 128)         0         
 lDropout1D)                                                     
                                                                 
 lstm_1 (LSTM)               (None, 196)               254800    
                                                                 
 dense_1 (Dense)             (None, 2)                 394       
                                                                 
Total params: 291,034
Trainable params: 291,034
Non-trainable params: 0
_________________________________________________________________
None


In [23]:
batch_size = 32
epochs = 5
model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=batch_size, verbose=1, epochs=epochs)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x166c50f0370>

In [24]:
model

<keras.engine.sequential.Sequential at 0x166e9cb0eb0>

In [25]:
MODEL_EXPORT_PATH = EXPORT_DIR / 'spam-model.h5'
model.save(str(MODEL_EXPORT_PATH))

## Predict New Data

In [26]:
import numpy as np

def predict(text_str, max_words=280, max_sequence = 280, tokenizer=None):
  if not tokenizer:
    return None
  sequences = tokenizer.texts_to_sequences([text_str])
  x_input = pad_sequences(sequences, maxlen=max_sequence)
  y_output = model.predict(x_input)
  top_y_index = np.argmax(y_output)
  preds = y_output[top_y_index]
  labeled_preds = [{f"{labels_legend_inverted[str(i)]}": x} for i, x in enumerate(preds)]
  return labeled_preds

In [27]:
predict("Buy me a new phone with discount", max_words=max_words, max_sequence=max_sequence, tokenizer=tokenizer)

[{'ham': 0.90141803}, {'spam': 0.09858193}]

## Exporting Tokenizer & Metadata

In [28]:
metadata = {
    "labels_legend_inverted": labels_legend_inverted,
    "legend": legend,
    "max_sequence": max_sequence,
    "max_words": max_words,
}

METADATA_EXPORT_PATH = EXPORT_DIR / 'spam-classifer-metadata.json'
METADATA_EXPORT_PATH.write_text(json.dumps(metadata, indent=4))

187

In [29]:
tokenizer_as_json = tokenizer.to_json()

TOKENIZER_EXPORT_PATH = EXPORT_DIR / 'spam-classifer-tokenizer.json'
TOKENIZER_EXPORT_PATH.write_text(tokenizer_as_json)

828992

We can load <code>tokenizer_as_json</code> with <code>tensorflow.keras.preprocessing.text.tokenizer_from_json</code>.

## Upload Model, Tokenizer, & Metadata to Object Storage


Object Storage options include:
- AWS S3
- Linode Object Storage
- DigitalOcean Spaces

All three of these options can use <code>boto3</code>.

In [43]:
# AWS S3 Config
ACCESS_KEY = ""
SECRET_KEY = ""

# No need to set in AWS
ENDPOINT = None

# Your s3-bucket region
REGION = ""

BUCKET_NAME = ""

## Perform Upload with Boto3

In [44]:
os.environ["AWS_ACCESS_KEY_ID"] = ACCESS_KEY
os.environ["AWS_SECRET_ACCESS_KEY"] = SECRET_KEY

In [45]:
MODEL_KEY_NAME = f"exports/spam-sms/{MODEL_EXPORT_PATH.name}"
TOKENIZER_KEY_NAME = f"exports/spam-sms/{TOKENIZER_EXPORT_PATH.name}"
METADATA_KEY_NAME = f"exports/spam-sms/{METADATA_EXPORT_PATH.name}"

In [46]:
session = boto3.session.Session()
client = session.client("s3", region_name=REGION, endpoint_url=ENDPOINT)
client.upload_file(str(MODEL_EXPORT_PATH), BUCKET_NAME, MODEL_KEY_NAME)
client.upload_file(str(TOKENIZER_EXPORT_PATH), BUCKET_NAME, TOKENIZER_KEY_NAME)
client.upload_file(str(METADATA_EXPORT_PATH), BUCKET_NAME, METADATA_KEY_NAME)

client.download_file(str(MODEL_EXPORT_PATH), BUCKET_NAME, MODEL_KEY_NAME)
client.download_file(str(TOKENIZER_EXPORT_PATH), BUCKET_NAME, TOKENIZER_KEY_NAME)
client.download_file(str(METADATA_EXPORT_PATH), BUCKET_NAME, METADATA_KEY_NAME)

# Implement an AI Model Download Pipeline

In this part, we will turn to <code>client.download_file()</code> portion into a pipeline so we can make it reusable in future projects. Further, if we ever need to bundle these models into a Docket image, we can use this pipeline created.