In this exercise, we will learn how to implement a deep neural network in <code>keras</code> in a practical way. We will start by reading the data and analyzing it exploratoryly, then using various preprocessing, we will prepare the data in a way that is suitable for use in a deep learning model, and finally we will build and train a neural network. The dataset we will use contains information about pets. The goal of this problem is to predict whether a pet will find a new home or not.

In [1]:
import numpy as np
import pandas as pd

1. Reading the data
To start, we need to read the files <code>petfinder_train.csv</code> and <code>petfinder_test.csv</code>. These files are in the data folder.

In [2]:
train = pd.read_csv('./data/petfinder_train.csv')
test = pd.read_csv('./data/petfinder_test.csv')

2. Create a <code>Target</code> column
<br>
The goal is to convert the <code>AdoptionSpeed</code> ​​column into a binary column called <code>Target</code>. According to the description, animals with an <code>AdoptionSpeed</code> ​​between 0 and 3 are adopted (``True``), and if it is 4, they are not adopted (``False``).

In [3]:
train['Target'] = np.where(train['AdoptionSpeed'] < 4, True, False)

3. Remove unnecessary columns
<br>
We remove the ``AdoptionSpeed`` ​​and ``Description`` columns from the ``train`` dataframe and the ``Description`` column from the ``test`` dataframe.

In [4]:
train.drop(columns=['AdoptionSpeed', 'Description'], inplace=True)
test.drop(columns=['Description'], inplace=True)

4. Encoding ordinal variables
<br>
The columns ``MaturitySize``, ``FurLength``, and ``Health`` are ordinal variables, so we convert them to numeric values ​​using ``LabelEncoder``. For the test dataset, we only use ``transform`` to use the values ​​learned from the train dataset.

In [5]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# MaturitySize
train['MaturitySize'] = le.fit_transform(train['MaturitySize'])
test['MaturitySize'] = le.transform(test['MaturitySize'])

# FurLength
train['FurLength'] = le.fit_transform(train['FurLength'])
test['FurLength'] = le.transform(test['FurLength'])

# Health
train['Health'] = le.fit_transform(train['Health'])
test['Health'] = le.transform(test['Health'])

5. Encoding nominal variables
<br>
The ``Type`` and ``Gender`` columns have only two values, so we can use ``LabelEncoder``. For the others (such as ``Breed1``, ``Color1``, ``Color2``, ``Vaccinated``, ``Sterilized``) that have more values, we use ``BinaryEncoder``.

In [6]:
# Type
le = LabelEncoder()
train['Type'] = le.fit_transform(train['Type'])
test['Type'] = le.transform(test['Type'])

# Gender
train['Gender'] = le.fit_transform(train['Gender'])
test['Gender'] = le.transform(test['Gender'])

For columns with a large number of unique values, we use ``BinaryEncoder`` to convert them to binary columns.

In [7]:
import category_encoders as ce

binary_encoder = ce.BinaryEncoder(cols=['Breed1', 'Color1', 'Color2', 'Vaccinated', 'Sterilized'])

train_binary = binary_encoder.fit_transform(train[['Breed1', 'Color1', 'Color2', 'Vaccinated', 'Sterilized']])
test_binary = binary_encoder.transform(test[['Breed1', 'Color1', 'Color2', 'Vaccinated', 'Sterilized']])

train = pd.concat([train, train_binary], axis=1)
test = pd.concat([test, test_binary], axis=1)

columns = ['Breed1', 'Color1', 'Color2', 'Vaccinated', 'Sterilized']
train.drop(columns=columns, inplace=True)
test.drop(columns=columns, inplace=True)

6. Data Normalization
<br>
We normalize the numeric columns (``Age``, ``Fee``, ``PhotoAmt``, and binary/ordinal columns) so that their mean is 0 and their standard deviation is 1. We do this using only the mean and standard deviation of the train dataset.

In [8]:
columns = [col for col in train.columns if col != 'Target'] 
for column in columns:
    mean = train[column].mean()
    std = train[column].std()
    train[column] = (train[column] - mean) / std
    test[column] = (test[column] - mean) / std

In [9]:
test

Unnamed: 0,Type,Age,Gender,MaturitySize,FurLength,Health,Fee,PhotoAmt,Breed1_0,Breed1_1,...,Color1_0,Color1_1,Color1_2,Color2_0,Color2_1,Color2_2,Vaccinated_0,Vaccinated_1,Sterilized_0,Sterilized_1
0,0.866680,-0.501042,-0.879677,-0.226574,0.790516,-0.194365,-0.29796,-0.830874,-0.073746,-0.184801,...,-0.521041,-0.861486,0.761850,-0.520158,0.777326,0.637933,-1.151487,0.386892,-0.730867,0.360165
1,-1.153719,0.015102,-0.879677,-0.226574,-0.845318,4.568489,-0.29796,-0.830874,-0.073746,-0.184801,...,-0.521041,1.160675,-1.312469,1.922312,0.777326,-1.567414,0.868360,-2.584453,1.368109,-2.776244
2,-1.153719,-0.449427,-0.879677,1.629890,0.790516,-0.194365,-0.29796,-0.512200,-0.073746,-0.184801,...,-0.521041,-0.861486,0.761850,-0.520158,-1.286339,0.637933,-1.151487,0.386892,-0.730867,0.360165
3,0.866680,-0.501042,-0.879677,-0.226574,-0.845318,-0.194365,-0.29796,0.443820,-0.073746,-0.184801,...,-0.521041,1.160675,0.761850,-0.520158,0.777326,0.637933,0.868360,0.386892,-0.730867,0.360165
4,0.866680,0.015102,-0.879677,-0.226574,-2.481151,-0.194365,-0.29796,-0.512200,-0.073746,-0.184801,...,1.919052,1.160675,-1.312469,-0.520158,0.777326,0.637933,0.868360,-2.584453,1.368109,-2.776244
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.866680,0.634474,1.136673,-0.226574,-0.845318,-0.194365,-0.29796,-1.149547,-0.073746,-0.184801,...,-0.521041,1.160675,-1.312469,1.922312,0.777326,-1.567414,0.868360,-2.584453,-0.730867,0.360165
996,-1.153719,-0.552656,-0.879677,-0.226574,0.790516,-0.194365,-0.29796,-0.830874,-0.073746,-0.184801,...,-0.521041,1.160675,0.761850,1.922312,-1.286339,-1.567414,-1.151487,0.386892,-0.730867,0.360165
997,0.866680,-0.294584,-0.879677,1.629890,-2.481151,-0.194365,-0.29796,-0.830874,-0.073746,-0.184801,...,-0.521041,-0.861486,0.761850,-0.520158,-1.286339,0.637933,0.868360,0.386892,-0.730867,0.360165
998,-1.153719,-0.139741,-0.879677,1.629890,0.790516,-0.194365,-0.29796,-0.193527,-0.073746,-0.184801,...,1.919052,1.160675,0.761850,-0.520158,-1.286339,0.637933,0.868360,0.386892,1.368109,0.360165


In [10]:
train

Unnamed: 0,Type,Age,Gender,MaturitySize,FurLength,Health,Fee,PhotoAmt,Target,Breed1_0,...,Color1_0,Color1_1,Color1_2,Color2_0,Color2_1,Color2_2,Vaccinated_0,Vaccinated_1,Sterilized_0,Sterilized_1
0,-1.153719,-0.449427,1.136673,1.629890,0.790516,-0.194365,0.949788,-0.830874,True,-0.073746,...,-0.521041,-0.861486,0.761850,-0.520158,-1.286339,0.637933,-1.151487,0.386892,-0.730867,0.360165
1,-1.153719,-0.552656,1.136673,-0.226574,-0.845318,-0.194365,-0.297960,-0.512200,True,-0.073746,...,-0.521041,-0.861486,0.761850,-0.520158,0.777326,-1.567414,0.868360,-2.584453,1.368109,-2.776244
2,0.866680,-0.552656,1.136673,-0.226574,-0.845318,-0.194365,-0.297960,1.081167,True,-0.073746,...,-0.521041,1.160675,-1.312469,-0.520158,-1.286339,0.637933,0.868360,0.386892,-0.730867,0.360165
3,0.866680,-0.397813,-0.879677,-0.226574,0.790516,-0.194365,1.573661,1.399841,True,-0.073746,...,-0.521041,-0.861486,0.761850,-0.520158,0.777326,-1.567414,0.868360,0.386892,-0.730867,0.360165
4,0.866680,-0.552656,1.136673,-0.226574,0.790516,-0.194365,-0.297960,-0.193527,True,-0.073746,...,-0.521041,-0.861486,0.761850,-0.520158,0.777326,0.637933,-1.151487,0.386892,-0.730867,0.360165
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10532,0.866680,-0.552656,-0.879677,-0.226574,-0.845318,-0.194365,-0.297960,0.125147,False,-0.073746,...,1.919052,1.160675,-1.312469,-0.520158,0.777326,0.637933,-1.151487,0.386892,-0.730867,0.360165
10533,0.866680,0.118330,-0.879677,-0.226574,-0.845318,-0.194365,-0.297960,3.949229,True,-0.073746,...,1.919052,1.160675,-1.312469,-0.520158,0.777326,0.637933,-1.151487,0.386892,-0.730867,0.360165
10534,0.866680,-0.397813,-0.879677,-0.226574,0.790516,-0.194365,-0.297960,1.718514,False,-0.073746,...,-0.521041,1.160675,-1.312469,-0.520158,0.777326,0.637933,0.868360,0.386892,1.368109,0.360165
10535,0.866680,-0.449427,-0.879677,-0.226574,0.790516,-0.194365,0.325914,0.125147,True,-0.073746,...,-0.521041,-0.861486,0.761850,-0.520158,0.777326,-1.567414,-1.151487,0.386892,-0.730867,0.360165


7. Split the data into train and validation
<br>
We separate 10% of the train data for validation.

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    train.drop(columns=['Target']), train['Target'], test_size=0.1, random_state=42
)

print('Train examples:', len(X_train), len(y_train))
print('Validation examples:', len(X_valid), len(y_valid))
print('Test examples:', len(test))

Train examples: 9483 9483
Validation examples: 1054 1054
Test examples: 1000


8. Building a neural network
<br>
We build a network with the desired structure: an input layer, three dense layers with 5000, 1000, and 500 neurons with ``relu`` activators, and an output layer with ``sigmoid`` activators.

In [None]:
import keras
from keras import layers

model = keras.Sequential()
model.add(keras.layers.Input(shape=(X_train.shape[1],)))  
model.add(keras.layers.Dense(5000, activation='relu'))
model.add(keras.layers.Dense(1000, activation='relu'))
model.add(keras.layers.Dense(500, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid')) 

ModuleNotFoundError: No module named 'keras'

9. View Model Summary
<br>
To review the model structure:

In [None]:
model.summary()

10. Compile the model
<br>
We compile the model with the ``adam`` optimizer, ``BinaryCrossentropy`` cost function, and ``accuracy`` metric.

In [None]:
model.compile(optimizer='adam',
              loss=keras.losses.BinaryCrossentropy(from_logits=False),
              metrics=['accuracy'])

11. Model Training
<br>
We train the model with ``batch_size=128`` and ``epochs=10`` and also specify the validation data.

In [None]:
epochs = 10
BATCH_SIZE = 128

history = model.fit(X_train, y_train,
                    batch_size=BATCH_SIZE,
                    epochs=epochs,
                    validation_data=(X_valid, y_valid))

12. Prediction for test data
We predict for the test dataset and store the output as `True/False` in the `submission` dataframe. Since the output of `sigmoid` is a value between 0 and 1, we consider a threshold of 0.5 for conversion to binary.
<br>
Explanation: `predictions` is an array with probability values ​​between 0 and 1. We convert it to `True/False` with `(predictions > 0.5)`.

In [None]:
predictions = model.predict(test)
submission = pd.DataFrame({'Target': (predictions > 0.5).flatten()})

13. Checking the evaluation criteria
<br>
To check the performance of the model on the validation data, we can calculate the `F1` score:

In [None]:
from sklearn.metrics import f1_score

val_predictions = model.predict(X_valid)
val_predictions_binary = (val_predictions > 0.5).flatten()
f1 = f1_score(y_valid, val_predictions_binary, average='weighted')
print(f'F1 Score on validation set: {f1}')