# Build a Spam Classifier with Keras

With deep learning and AI, handling spam content has gotten easier and easier. Over time (and with the aid of direct user feedback) our spam classifier will rarely produce erroneous results.

This is the first part of a multi-part series covering how to:

- Build an AI Model (this one)
- Integrate a NoSQL Database (inference result storing)
- Deploy an AI Model into Production

### Prerequisites

- Prepare your dataset using [this notebook](https://github.com/codingforentrepreneurs/AI-as-an-API/blob/main/guides/spam-classifier/1%20-%20Prepare%20the%20AI%20Spam%20Classifier%20Dataset.ipynb) .
- Convert your dataset into trainable vectors in [this notebook](https://github.com/codingforentrepreneurs/AI-as-an-API/blob/main/guides/spam-classifier/2%20-%20Convert%20Dataset%20into%20Vectors.ipynb) (Either way, this notebook will run this step for us).

### Running this notebook:

- Recommended: Use [Colab](https://colab.research.google.com/github/codingforentrepreneurs/AI-as-an-API/blob/main/guides/spam-classifier/Spam_Classifier_with_Keras.ipynb) as it offers free GPUs for training models. [Launch this notebook here](<[Colab](https://colab.research.google.com/github/codingforentrepreneurs/AI-as-an-API/blob/main/guides/spam-classifier/Spam_Classifier_with_Keras.ipynb)>)
- Fork [the AI as an API repo](https://github.com/codingforentrepreneurs/AI-as-an-API) and run `guides/spam-classifier/Spam_Classifier_with_Keras.ipynb` whenever you'd like.

This notebook is brought to in you in partnership with [DataStax](https://dtsx.io/3nRWZEG).


In [9]:
!pip install boto3
!pip install -U pandas tensorflow

Collecting pandas
  Downloading pandas-2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m41.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.0.3
    Uninstalling pandas-2.0.3:
      Successfully uninstalled pandas-2.0.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.0.3, but you have pandas 2.2.1 which is incompatible.[0m[31m
[0mSuccessfully installed pandas-2.2.1


In [10]:
import boto3
import os
import pathlib
import pandas as pd
import pickle

In [11]:
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding, LSTM, SpatialDropout1D
from tensorflow.keras.models import Model, Sequential

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [12]:
EXPORT_DIR = pathlib.Path('/datasets/exports/')
GUIDES_DIR = pathlib.Path("/guides/spam-classifier/")
DATASET_CSV_PATH = EXPORT_DIR / 'spam-dataset.csv'
TRAINING_DATA_PATH = EXPORT_DIR / 'spam-training-data.pkl'
CONVERT_DATA_GUIDE_PATH = GUIDES_DIR / "Convert Dataset into Vectors.ipynb"

## Prepare Dataset

Creating a dataset rarely happens next to where you run the training. The below cells are a method for us to extract the needed data to perform training against.


In [13]:
!mkdir -p "$EXPORT_DIR"
!mkdir -p "$GUIDES_DIR"
!curl "https://raw.githubusercontent.com/EdwardKWang/machine-learning/main/lstm-spam-api/datasets/exports/spam-dataset.csv" -o "$DATASET_CSV_PATH"
!curl "https://raw.githubusercontent.com/EdwardKWang/machine-learning/main/lstm-spam-api/guides/spam-classifier/Convert%20Dataset%20into%20Vectors.ipynb" -o "$CONVERT_DATA_GUIDE_PATH"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  787k  100  787k    0     0  2199k      0 --:--:-- --:--:-- --:--:-- 2199k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15587  100 15587    0     0  65493      0 --:--:-- --:--:-- --:--:-- 65767


Let's review our extracted dataset which combines two different spam datasets as outlined in [this notebook](https://github.com/codingforentrepreneurs/AI-as-an-API/blob/main/guides/spam-classifier/1%20-%20Prepare%20the%20AI%20Spam%20Classifier%20Dataset.ipynb).


In [14]:
df = pd.read_csv(DATASET_CSV_PATH)
df.head()

Unnamed: 0,label,text,source,raw_source
0,ham,"Go until jurong point, crazy.. Available only ...",sms=spam,
1,ham,Ok lar... Joking wif u oni...,sms=spam,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,sms=spam,
3,ham,U dun say so early hor... U c already then say...,sms=spam,
4,ham,"Nah I don't think he goes to usf, he lives aro...",sms=spam,


In [this notebook](https://github.com/codingforentrepreneurs/AI-as-an-API/blob/main/guides/spam-classifier/2%20-%20Convert%20Dataset%20into%20Vectors.ipynb) we prepare our dataset (`spam-dataset.csv`) to be fully ready for training on a model. Below is a command to run that notebook.


In [15]:
%run "$CONVERT_DATA_GUIDE_PATH"

BASE_DIR is /
Random Index 4128
Found 12077 unique tokens.


Extract prepared training dataset results.


In [16]:
data = {}

with open(TRAINING_DATA_PATH, 'rb') as f:
    data = pickle.load(f)

> While the above code uses `pickle` to load in data, this data is actually exported via `pickle` when we execute the `%run` only a few steps ago. Since `pickle` can be unsafe to use from third-party downloaded data, we actually generate (again using `%run`) this pickle data and therefore is safe to use -- it's never downloaded.


## Transform Extracted Dataset


In [17]:
X_test = data['X_test']
X_train = data['X_train']
y_test = data['y_test']
y_train = data['y_train']
labels_legend_inverted = data['labels_legend_inverted']
legend = data['legend']
max_sequence = data['max_sequence']
max_words = data['max_words']
tokenizer = data['tokenizer']

## Create our LSTM Model


In [18]:
embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(MAX_NUM_WORDS, embed_dim, input_shape=(X_train.shape[1],)))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.3, recurrent_dropout=0.3))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy'])
print(model.summary())

  super().__init__(**kwargs)


None


In [20]:
batch_size = 32
epochs = 5
model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=batch_size, verbose=1, epochs=epochs)

Epoch 1/5
[1m158/158[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m86s[0m 512ms/step - accuracy: 0.8090 - loss: 0.4344 - val_accuracy: 0.9501 - val_loss: 0.1597
Epoch 2/5
[1m158/158[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m139s[0m 512ms/step - accuracy: 0.9575 - loss: 0.1387 - val_accuracy: 0.9590 - val_loss: 0.1373
Epoch 3/5
[1m158/158[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 510ms/step - accuracy: 0.9578 - loss: 0.1325 - val_accuracy: 0.9577 - val_loss: 0.1391
Epoch 4/5
[1m158/158[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 505ms/step - accuracy: 0.9654 - loss: 0.1132 - val_accuracy: 0.9557 - val_loss: 0.1335
Epoch 5/5
[1m158/158[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 509ms/step - accuracy: 0.9594 - loss: 0.1241 - val_accuracy: 0.9573 - val_loss: 0.1387


<keras.src.callbacks.history.History at 0x7a75981767a0>

In [21]:
MODEL_EXPORT_PATH = EXPORT_DIR / 'spam-model.h5'
model.save(str(MODEL_EXPORT_PATH))



## Predict new data


In [22]:
import numpy as np

def predict(text_str, max_words=280, max_sequence = 280, tokenizer=None):
  if not tokenizer:
    return None
  sequences = tokenizer.texts_to_sequences([text_str])
  x_input = pad_sequences(sequences, maxlen=max_sequence)
  y_output = model.predict(x_input)
  top_y_index = np.argmax(y_output)
  preds = y_output[top_y_index]
  labeled_preds = [{f"{labels_legend_inverted[str(i)]}": x} for i, x in enumerate(preds)]
  return labeled_preds

In [23]:
predict("hello world", max_words=max_words, max_sequence=max_sequence, tokenizer=tokenizer)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 449ms/step


[{'ham': 0.9872537}, {'spam': 0.012746287}]

## Exporting Tokenizer & Metadata


In [24]:
import json
metadata = {
    "labels_legend_inverted": labels_legend_inverted,
    "legend": legend,
    "max_sequence": max_sequence,
    "max_words": max_words,
}

METADATA_EXPORT_PATH = EXPORT_DIR / 'spam-classifier-metadata.json'
METADATA_EXPORT_PATH.write_text(json.dumps(metadata, indent=4))

187

In [25]:
tokenizer_as_json = tokenizer.to_json()

TOKENIZER_EXPORT_PATH = EXPORT_DIR / 'spam-classifier-tokenizer.json'
TOKENIZER_EXPORT_PATH.write_text(tokenizer_as_json)

1090335

We can load `tokenizer_as_json` with `tensorflow.keras.preprocessing.text.tokenizer_from_json`.


## Upload Model, Tokenizer, & Metadata to Object Storage

Object Storage options include:

- AWS S3
- Linode Object Storage
- DigitalOcean Spaces

All three of these options can use `boto3`.


In [None]:
# AWS S3 Config
ACCESS_KEY = "<your_aws_iam_key_id>"
SECRET_KEY = "<your_aws_iam_secret_key>"

# You should not have to set this
ENDPOINT = None

# Your s3-bucket region
REGION = 'us-west-1'

BUCKET_NAME = '<your_s3_bucket_name>'

#### Linode Object Storage Config


In [26]:
ACCESS_KEY = "<your_linode_access_key>"
SECRET_KEY = "<your_linode_secret_key>"

# Object Storage Endpoint URL
ENDPOINT = "https://lstm-spam-api.us-east-1.linodeobjects.com"

# Object Storage Endpoint Region (also in your endpoint url)
REGION = 'us-east-1'

# Set this to a valid slug (without a "/" )
BUCKET_NAME = 'datasets'

#### DigitalOcean Spaces Config


In [None]:
ACCESS_KEY = "<your_do_spaces_access_key>"
SECRET_KEY = "<your_do_spaces_secret_key>"

# Space Endpoint URL
ENDPOINT = "https://ai-cfe-1.nyc3.digitaloceanspaces.com"

# Space Region (also in your endpoint url)
REGION = 'nyc3'

# Set this to a valid slug (without a "/" )
BUCKET_NAME = 'datasets'

## Perform Upload with Boto3


In [27]:
os.environ["AWS_ACCESS_KEY_ID"] = ACCESS_KEY
os.environ["AWS_SECRET_ACCESS_KEY"] = SECRET_KEY

In [28]:
# Upload paths
MODEL_KEY_NAME = f"exports/spam-sms/{MODEL_EXPORT_PATH.name}"
TOKENIZER_KEY_NAME = f"exports/spam-sms/{TOKENIZER_EXPORT_PATH.name}"
METADATA_KEY_NAME = f"exports/spam-sms/{METADATA_EXPORT_PATH.name}"

In [29]:
session = boto3.session.Session()
client = session.client('s3', region_name=REGION, endpoint_url=ENDPOINT)
client.upload_file(str(MODEL_EXPORT_PATH), BUCKET_NAME,  MODEL_KEY_NAME)
client.upload_file(str(TOKENIZER_EXPORT_PATH), BUCKET_NAME,  TOKENIZER_KEY_NAME)
client.upload_file(str(METADATA_EXPORT_PATH), BUCKET_NAME,  METADATA_KEY_NAME)

In [30]:
client.download_file(BUCKET_NAME, MODEL_KEY_NAME, pathlib.Path(MODEL_KEY_NAME).name)
client.download_file(BUCKET_NAME, TOKENIZER_KEY_NAME, pathlib.Path(TOKENIZER_KEY_NAME).name)
client.download_file(BUCKET_NAME, METADATA_KEY_NAME, pathlib.Path(METADATA_KEY_NAME).name)

## Implement an AI Model Download Pipeline

In [this blog post](https://www.codingforentrepreneurs.com/blog/ai-model-download-pipeline) I'll show you how to turn the `client.download_file()` portion into a pipeline so you can make it reusable in future projects. Further, if you ever need to bundle these models into a Docker image, you will be able to use the pipeline.
