### Classify structured data using Keras Preprocessing Layers

In this notebook we are going to use tensorflow and keras to classify whether a pet will be adopted or not. This notebbok is basing on [this](https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers) tensorflow tutorial.


### Imports

In [25]:
!pip install prettytable

Collecting prettytable
  Downloading prettytable-2.2.0-py3-none-any.whl (23 kB)
Installing collected packages: prettytable
Successfully installed prettytable-2.2.0


In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras.layers.experimental import preprocessing

tf.__version__

'2.5.0'

A function that will help us to tabulate data.

In [26]:
from prettytable import PrettyTable

def tabulate(column_names, data, title="VISUALIZING SETS EXAMPLES"):
    table = PrettyTable(column_names)
    table.title= title
    for row in data:
        table.add_row(row)
    print(table)

### Data
The following code cell will be responsible of downloading the data that we will be working with in this notebook.This dataset can be found [here](https://www.kaggle.com/c/petfinder-adoption-prediction). **(PetFinder)**

In [2]:
import pathlib
dataset_url = 'http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip'
csv_file = 'datasets/petfinder-mini/petfinder-mini.csv'

tf.keras.utils.get_file('petfinder_mini.zip', dataset_url,
                        extract=True, cache_dir='.')

dataframe = pd.read_csv(csv_file)

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/petfinder-mini.zip


In [6]:
import os, time

os.remove("datasets/petfinder_mini.zip")
print("Done.")

Done.


### Visualizing the data

In [7]:
dataframe.head()

Unnamed: 0,Type,Age,Breed1,Gender,Color1,Color2,MaturitySize,FurLength,Vaccinated,Sterilized,Health,Fee,Description,PhotoAmt,AdoptionSpeed
0,Cat,3,Tabby,Male,Black,White,Small,Short,No,No,Healthy,100,Nibble is a 3+ month old ball of cuteness. He ...,1,2
1,Cat,1,Domestic Medium Hair,Male,Black,Brown,Medium,Medium,Not Sure,Not Sure,Healthy,0,I just found it alone yesterday near my apartm...,2,0
2,Dog,1,Mixed Breed,Male,Brown,White,Medium,Medium,Yes,No,Healthy,0,Their pregnant mother was dumped by her irresp...,7,3
3,Dog,4,Mixed Breed,Female,Black,Brown,Medium,Short,Yes,No,Healthy,150,"Good guard dog, very alert, active, obedience ...",8,2
4,Dog,1,Mixed Breed,Male,Black,No Color,Medium,Short,No,No,Healthy,0,This handsome yet cute boy is up for adoption....,3,2


In [8]:
dataframe.describe()

Unnamed: 0,Age,Fee,PhotoAmt,AdoptionSpeed
count,11537.0,11537.0,11537.0,11537.0
mean,11.743434,23.957268,3.610211,2.486522
std,19.324221,80.024226,3.145872,1.173275
min,0.0,0.0,0.0,0.0
25%,2.0,0.0,2.0,2.0
50%,4.0,0.0,3.0,2.0
75%,12.0,0.0,5.0,4.0
max,255.0,2000.0,30.0,4.0


### Creating target variable.
The task original task on kaggle was to predict the speed in which the pet will be adopted. In this notebook we are going to simply it into a binary classification and classify weather the pet will be adopted or not. Our labels will be as follows:

1. `1` -> the pet will be adopted
2. `0` -> the ped will not be adopted

In the original dataset `4` indicated that the pet was not adopted.

In [9]:
dataframe.columns

Index(['Type', 'Age', 'Breed1', 'Gender', 'Color1', 'Color2', 'MaturitySize',
       'FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Fee', 'Description',
       'PhotoAmt', 'AdoptionSpeed'],
      dtype='object')

In [11]:
dataframe.iloc[0]

Type                                                           Cat
Age                                                              3
Breed1                                                       Tabby
Gender                                                        Male
Color1                                                       Black
Color2                                                       White
MaturitySize                                                 Small
FurLength                                                    Short
Vaccinated                                                      No
Sterilized                                                      No
Health                                                     Healthy
Fee                                                            100
Description      Nibble is a 3+ month old ball of cuteness. He ...
PhotoAmt                                                         1
AdoptionSpeed                                                 

In [12]:
dataframe["Type"].unique()

array(['Cat', 'Dog'], dtype=object)

In [16]:
dataframe["target"] = np.where(dataframe["AdoptionSpeed"] == 4, 0, 1)
dataframe = dataframe.drop(columns=['AdoptionSpeed', 'Description'])

In [17]:
dataframe.columns

Index(['Type', 'Age', 'Breed1', 'Gender', 'Color1', 'Color2', 'MaturitySize',
       'FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Fee', 'PhotoAmt',
       'target'],
      dtype='object')

In [23]:
dataframe.target.describe()

count    11537.000000
mean         0.733033
std          0.442394
min          0.000000
25%          0.000000
50%          1.000000
75%          1.000000
max          1.000000
Name: target, dtype: float64

### Spliting the sets

Next we are going to split the data into three sets which are, `train`, `validation` and `test` sets.

In [24]:
train, test = train_test_split(dataframe, test_size=.2, random_state=24)
train, val = train_test_split(dataframe, test_size=.2, random_state=24)

In [27]:
column_names = ["SUBSET", "EXAMPLE(s)"]
row_data = [
        ["training", len(train)],
        ['validation', len(val)],
        ['test', len(test)]
]
tabulate(column_names, row_data)

+-----------------------------+
|  VISUALIZING SETS EXAMPLES  |
+--------------+--------------+
|    SUBSET    |  EXAMPLE(s)  |
+--------------+--------------+
|   training   |     9229     |
|  validation  |     2308     |
|     test     |     2308     |
+--------------+--------------+


### Creating an input pipeline.
We are going to use the `tf.data`. This give us the ability to create batches on data, shuffle the data etc.

In [29]:
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    dataframe = dataframe.copy()
    labels = dataframe.pop('target')
    ds = tf.data.Dataset.from_tensor_slices((
        dict(dataframe), labels
    ))
    return ds.shuffle(len(dataframe)).batch(batch_size).prefetch(batch_size)

In [30]:
BATCH_SIZE = 16
train_ds = df_to_dataset(train, batch_size=BATCH_SIZE)

In [31]:
[(train_features, label_batch)] = train_ds.take(1)
print('Every feature:', list(train_features.keys()))
print('A batch of ages:', train_features['Age'])
print('A batch of targets:', label_batch )

Every feature: ['Type', 'Age', 'Breed1', 'Gender', 'Color1', 'Color2', 'MaturitySize', 'FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Fee', 'PhotoAmt']
A batch of ages: tf.Tensor([ 2  1  2  2  1  2  2  9 20  2  9  5 72  2 48  3], shape=(16,), dtype=int64)
A batch of targets: tf.Tensor([0 1 1 1 1 1 1 0 0 1 0 1 0 1 1 1], shape=(16,), dtype=int32)


### The preprocessing layer
These keras layers allows us to preprocess the input pipelines. We are going to use the following preprocessing layers:

1. `Normalization` - Feature-wise normalization of the data.

2. `CategoryEncoding` - Category encoding layer.

3. `StringLookup` - Maps strings from a vocabulary to integer indices.

4. `IntegerLookup` - Maps integers from a vocabulary to integer indices.

In [32]:
# get_normalization_layer function returns a layer which applies featurewise normalization to numerical features

def get_normalization_layer(name, dataset):
    # Create a Normalization layer for our feature.
    normalizer = preprocessing.Normalization(axis=None)
    # Prepare a Dataset that only yields our feature.
    feature_ds = dataset.map(lambda x, y: x[name])
    # Learn the statistics of the data.
    normalizer.adapt(feature_ds)
    return normalizer

In [33]:
photo_count_col = train_features['PhotoAmt']
layer = get_normalization_layer('PhotoAmt', train_ds)
layer(photo_count_col)

<tf.Tensor: shape=(16, 1), dtype=float32, numpy=
array([[-0.83305043],
       [ 1.0698183 ],
       [-0.83305043],
       [-0.5159057 ],
       [-0.83305043],
       [ 1.0698183 ],
       [-0.5159057 ],
       [ 1.0698183 ],
       [-0.5159057 ],
       [-0.19876088],
       [-0.5159057 ],
       [-0.19876088],
       [-0.5159057 ],
       [-0.5159057 ],
       [ 1.7041079 ],
       [-0.19876088]], dtype=float32)>

### Categorical columns
``get_category_encoding_layer`` function returns a layer which maps values from a vocabulary to integer indices and one-hot encodes the features.


In [34]:
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
   # Create a StringLookup layer which will turn strings into integer indices
    if dtype == 'string':
        index = preprocessing.StringLookup(max_tokens=max_tokens)
    else:
        index = preprocessing.IntegerLookup(max_tokens=max_tokens)

    # Prepare a Dataset that only yields our feature
    feature_ds = dataset.map(lambda x, y: x[name])

    # Learn the set of possible values and assign them a fixed integer index.
    index.adapt(feature_ds)

    # Create a Discretization for our integer indices.
    encoder = preprocessing.CategoryEncoding(num_tokens=index.vocabulary_size())

    # Apply one-hot encoding to our indices. The lambda function captures the
    # layer so we can use them, or include them in the functional model later.
    return lambda feature: encoder(index(feature))

In [35]:
type_col = train_features['Type']
layer = get_category_encoding_layer('Type', train_ds, 'string')
layer(type_col)

<tf.Tensor: shape=(16, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]], dtype=float32)>

### Creating datasets.

In [36]:
BATCH_SIZE = 256

train_ds = df_to_dataset(train, batch_size=BATCH_SIZE)
val_ds = df_to_dataset(val, shuffle=False, batch_size=BATCH_SIZE)
test_ds = df_to_dataset(test, shuffle=False, batch_size=BATCH_SIZE)


In [37]:
all_inputs = []
encoded_features = []

# Numeric features.
for header in ['PhotoAmt', 'Fee']:
    numeric_col = tf.keras.Input(shape=(1,), name=header)
    normalization_layer = get_normalization_layer(header, train_ds)
    encoded_numeric_col = normalization_layer(numeric_col)
    all_inputs.append(numeric_col)
    encoded_features.append(encoded_numeric_col)
    

In [38]:
# Categorical features encoded as integers.
age_col = tf.keras.Input(shape=(1,), name='Age', dtype='int64')
encoding_layer = get_category_encoding_layer('Age', train_ds, dtype='int64',
                                             max_tokens=5)
encoded_age_col = encoding_layer(age_col)
all_inputs.append(age_col)
encoded_features.append(encoded_age_col)

In [39]:
# Categorical features encoded as string.
categorical_cols = ['Type', 'Color1', 'Color2', 'Gender', 'MaturitySize',
                    'FurLength', 'Vaccinated', 'Sterilized', 'Health', 'Breed1']
for header in categorical_cols:
    categorical_col = tf.keras.Input(shape=(1,), name=header, dtype='string')
    encoding_layer = get_category_encoding_layer(header, train_ds, dtype='string',
                                               max_tokens=5)
    encoded_categorical_col = encoding_layer(categorical_col)
    all_inputs.append(categorical_col)
    encoded_features.append(encoded_categorical_col)

### Creating a model

In [40]:
all_features = tf.keras.layers.concatenate(encoded_features)
x = tf.keras.layers.Dense(32, activation="relu")(all_features)
x = tf.keras.layers.Dropout(0.5)(x)
output = tf.keras.layers.Dense(1)(x)
model = tf.keras.Model(all_inputs, output)
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=["accuracy"])

In [41]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Age (InputLayer)                [(None, 1)]          0                                            
__________________________________________________________________________________________________
Type (InputLayer)               [(None, 1)]          0                                            
__________________________________________________________________________________________________
Color1 (InputLayer)             [(None, 1)]          0                                            
__________________________________________________________________________________________________
Color2 (InputLayer)             [(None, 1)]          0                                            
______________________________________________________________________________________________

In [42]:
model.fit(train_ds, epochs=10, validation_data=val_ds)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x171bb3373d0>

In [43]:
loss, accuracy = model.evaluate(test_ds)
print("Accuracy", accuracy)

Accuracy 0.7205372452735901


### Model inference

In [45]:
sample = {
    'Type': 'Cat',
    'Age': 3,
    'Breed1': 'Tabby',
    'Gender': 'Male',
    'Color1': 'Black',
    'Color2': 'White',
    'MaturitySize': 'Small',
    'FurLength': 'Short',
    'Vaccinated': 'No',
    'Sterilized': 'No',
    'Health': 'Healthy',
    'Fee': 100,
    'PhotoAmt': 2,
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = model.predict(input_dict)
prob = tf.nn.sigmoid(predictions[0])

print(
    "This particular pet had a %.1f percent probability "
    "of getting adopted." % (100 * prob)
)

This particular pet had a 80.5 percent probability of getting adopted.


In [46]:
sample = {
    'Type': 'Dog',
    'Age': 1,
    'Breed1': 'Tabby',
    'Gender': 'Male',
    'Color1': 'Black',
    'Color2': 'White',
    'MaturitySize': 'Small',
    'FurLength': 'Short',
    'Vaccinated': 'No',
    'Sterilized': 'No',
    'Health': 'Healthy',
    'Fee': 100,
    'PhotoAmt': 2,
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = model.predict(input_dict)
prob = tf.nn.sigmoid(predictions[0])

print(
    "This particular pet had a %.1f percent probability "
    "of getting adopted." % (100 * prob)
)

This particular pet had a 90.9 percent probability of getting adopted.


In [48]:
sample = {
    'Type': 'Cat',
    'Age': 1,
    'Breed1': 'Tabby',
    'Gender': 'Male',
    'Color1': 'Black',
    'Color2': 'White',
    'MaturitySize': 'Small',
    'FurLength': 'Short',
    'Vaccinated': 'No',
    'Sterilized': 'No',
    'Health': 'Healthy',
    'Fee': 10,
    'PhotoAmt': 2,
}

input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
predictions = model.predict(input_dict)
prob = tf.nn.sigmoid(predictions[0])

print(
    "This particular pet had a %.1f percent probability "
    "of getting adopted." % (100 * prob)
)

This particular pet had a 90.6 percent probability of getting adopted.


### Conclusion 
We have leant how to perform a pets adoption using keras preprocessing layers. This can also be done using sklearn [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)