### Introduction

Data augmentation is widely used in image classification task in order to increase the number of training instances and at the same time make the model more stable. In this notebook i will try to apply the same technique to a rectangular dataset, in particular to the 'German credit risk dataset'. Since it is not possible to use linear transformation such as rotation or traslation, i am going to implement two different generative learning models in order to create new data. Note that we are not actually adding new instances to the data, but we are just sampling from the latent space. 

### Setup

In [1]:
# deep learning libraries
import tensorflow as tf
from tensorflow import keras
import keras.backend as K

# common imports
import pandas as pd
import numpy as np

# setting random seed
np.random.seed(1234)
tf.random.set_seed(1234)

# Style setup
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=16)
mpl.rc('ytick', labelsize=12)
plt.style.use('fivethirtyeight')
plt.xkcd(False) 

Using TensorFlow backend.


<matplotlib.rc_context at 0x22a7531ef60>

### Importing the data

I am uploading the data from local, plus i am going to apply some formatting on data types

In [2]:
df = pd.read_csv(r'C:\Users\Aless\Downloads\datasets_9109_12699_german_credit_data.csv', index_col = 'Unnamed: 0')
df.columns = df.columns.map(lambda x: x.replace(' ', '_'))
df.Job = df.Job.astype('object')
df.shape

(1000, 10)

### Data validation

In [3]:
df.isna().sum()

Age                   0
Sex                   0
Job                   0
Housing               0
Saving_accounts     183
Checking_account    394
Credit_amount         0
Duration              0
Purpose               0
Risk                  0
dtype: int64

In [4]:
df.Saving_accounts.value_counts(normalize = True)

little        0.738066
moderate      0.126071
quite rich    0.077111
rich          0.058752
Name: Saving_accounts, dtype: float64

'little' is the most frequent class by far, so i will use a simple imputer to replace missing values with np.nan

In [5]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')
df['Saving_accounts'] = imputer.fit_transform(df['Saving_accounts'].values.reshape(-1, 1))

Instead, for what concern checking_account we have a more balanced situation, so i am going to drop all missing values 

In [6]:
df.Checking_account.value_counts(normalize = True)

little      0.452145
moderate    0.443894
rich        0.103960
Name: Checking_account, dtype: float64

In [7]:
df.dropna(inplace = True)

It seems that we do not have significant differences between the positive and negative labels weights

In [8]:
df.Risk.value_counts(normalize = True)

good    0.580858
bad     0.419142
Name: Risk, dtype: float64

### Preprocessing

In [9]:
df['Sex'] = df.Sex.apply(lambda x: 1 if x == 'male' else 0)
df['Risk'] = df.Risk.apply(lambda x: 1 if x == 'bad' else 0)

from sklearn.preprocessing import MinMaxScaler

num_col = ['Age', 'Credit_amount', 'Duration']
scaler = MinMaxScaler()
df[num_col] = scaler.fit_transform(df[num_col].values)

df = pd.get_dummies(df, columns = ['Job', 'Housing', 'Saving_accounts', 'Checking_account', 'Purpose'])

### Creating sets

In [10]:
from sklearn.model_selection import train_test_split

X = df.drop('Risk', axis = 1).values
y = df.Risk.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

### Choosing the classifier

As base classifier i am going to compare a supported vector machine with xgboost: the one who perform better in cross  validation will be chosen 

In [11]:
import xgboost as xgb

clf = xgb.XGBClassifier(n_estimators = 200, learning_rate = 0.1)

from sklearn.model_selection import cross_val_score

scores = cross_val_score(clf, X_train, y_train, cv = 5, scoring = 'accuracy')
print('Average accuracy on a 10 folds cross validation: {}'.format(np.round(scores.mean(), 4)))

Average accuracy on a 10 folds cross validation: 0.5951


In [12]:
from sklearn.svm import SVC

svc = SVC(kernel = 'poly', gamma = 'scale')
scores_svm = cross_val_score(svc, X_train, y_train, cv = 5, scoring = 'accuracy')
print('Average accuracy on a 10 folds cross validation: {}'.format(np.round(scores_svm.mean(), 4)))

Average accuracy on a 10 folds cross validation: 0.5744


Xgb showed to achieve slightly better perfomance than svm, to it will be the classifier on which we will implement data augmentation

### Variational autoencoder

A variational autoencoder sample from the latent space the inputs for the decoder, allowing it to generate new instances with each sampling. In the following cell i am going to implement the architecture of the variational autoencoder, then i will fit it on data. Note that i am going to train the model with the whole dataset, addressing the task as an unsupervised problem. Since i want to tune the model to minimze the loss i am going to create a function that outputs the actual autoencoder

In [13]:
class Sampling(keras.layers.Layer):
    def call(self, inputs):
        mean, log_var = inputs
        return K.random_normal(tf.shape(log_var)) * K.exp(log_var / 2) + mean

def build_var_ae(codings_size = 15, n_layers = 3, n_neurons = 100, increase = 100, lr = 3e-3, decay = 1e-4, l2 = 0.01):
    codings_size = codings_size
    inputs = keras.layers.Input(shape = [X_train[0].shape[0] + 1])
    d = {}
    for hidden_layer in range(n_layers):
        if hidden_layer > 0:
            d['hidden_' + str(hidden_layer)] = keras.layers.Dense(n_neurons + increase * (n_layers - 1) - increase * hidden_layer, 
                                                                  activation = 'selu', kernel_initializer = 'lecun_normal',
                                       kernel_regularizer = keras.regularizers.l2(l2))(d['hidden_' + str(hidden_layer - 1)])
        else:
            d['hidden_' + str(hidden_layer)] = keras.layers.Dense(n_neurons * n_layers, activation = 'selu', 
                                  kernel_initializer = 'lecun_normal')(inputs)
    codings_mean = keras.layers.Dense(codings_size)(d['hidden_' + str(hidden_layer)])
    codings_log_var = keras.layers.Dense(codings_size)(d['hidden_' + str(hidden_layer)])
    codings = Sampling()([codings_mean, codings_log_var])
    var_enc = keras.Model(inputs = [inputs], outputs = [codings_mean, codings_log_var, codings])
    decoder_inputs = keras.layers.Input(shape = [codings_size])
    for hidden_layer in range(n_layers):
        if hidden_layer > 0:
            d['hidden_' + str(hidden_layer)] = keras.layers.Dense(n_neurons + increase * hidden_layer, activation = 'selu', 
                                                                  kernel_initializer = 'lecun_normal',
                               kernel_regularizer = keras.regularizers.l2(l2))(d['hidden_' + str(hidden_layer - 1)])
        else:
            d['hidden_' + str(hidden_layer)] = keras.layers.Dense(n_neurons, activation = 'selu', 
                                  kernel_initializer = 'lecun_normal')(decoder_inputs)
    
    y = keras.layers.Dense(X_train[0].shape[0] + 1, activation = 'sigmoid')( d['hidden_' + str(hidden_layer)])
    outputs = keras.layers.Reshape([X_train[0].shape[0] + 1, 1])(y)
    var_dec = keras.Model(inputs = [decoder_inputs], outputs = [outputs])
    _, _, codings = var_enc(inputs)
    reconstruction_data = var_dec(codings)
    var_ae = keras.Model(inputs = [inputs], outputs = [reconstruction_data])
    latent_loss = -0.5 * K.sum(1 + codings_log_var - K.exp(codings_log_var) - K.square(codings_mean), axis = -1)
    var_ae.add_loss(K.mean(latent_loss) / (X_train[0].shape[0] + 1))
    optimizer = keras.optimizers.Adam(learning_rate = lr, beta_1 = 0.9, beta_2 = 0.999)
    var_ae.compile(loss = 'mse', optimizer = optimizer)
    return var_ae

var_ae = build_var_ae()
keras_reg = keras.wrappers.scikit_learn.KerasRegressor(build_var_ae)
data = np.concatenate([X, y.reshape(-1, 1)], axis = 1)

### Hyperparameter tuning

Since the hyperparameter space is extremely wide, i am going to use a randomized search instead of a grid one

In [14]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal

param_distrib = {'n_layers' : [2, 3, 4, 5, 6],
                 'n_neurons' : np.arange(50, 150),
                 'increase' : np.arange(50, 150),
                 'lr' : reciprocal(3e-4, 3e-2),
                 'codings_size' : np.arange(2, 16),
                 'l2' : [0.01, 0.015, 0.02, 0.025],
                 'decay' : [1e-4, 2e-4, 3e-4, 5e-4]}
rnd_search = RandomizedSearchCV(keras_reg, param_distrib, n_iter = 10, cv = 3)
rnd_search.fit(data, data, epochs = 50, batch_size = 32)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/5



Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=<tensorflow.python.keras.wrappers.scikit_learn.KerasRegressor object at 0x0000022A108EEEF0>,
                   iid='warn', n_iter=10, n_jobs=None,
                   param_distributions={'codings_size': array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15]),
                                        'decay': [0.0001, 0.0002, 0.0003,
                                                  0.0005],
                                        'increase': array([ 50,  51,  52,  53,  54,  55,  56,  57,  58,...
        76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,
        89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101,
       102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114,
       115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127,
       128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140,
       141, 142, 143, 144, 145, 146, 147, 14

In [15]:
model = rnd_search.best_estimator_.model
var_dec = model.layers[2]

### Generating new instances

Now that i have a trained decoder i can feed it with data sampled from a normal distribution. This will produce new instances

In [16]:
codings_size = rnd_search.best_params_['codings_size']
codings = tf.random.normal(shape = [5000, codings_size])
new_data = var_dec(codings).numpy()
train_set = np.concatenate([X_train, y_train.reshape(-1, 1)], axis = 1)
augmented_data = np.concatenate([train_set, new_data.reshape(-1, X_train[0].shape[0] + 1)], axis = 0)
np.random.shuffle(augmented_data)
# creating new training set
aug_X = augmented_data[:, :-1]
aug_y = augmented_data[:, -1]
aug_y = [round(x) for x in aug_y]

Now i will compute the cross validation accuracy score

In [17]:
scores = cross_val_score(clf, aug_X, aug_y, cv = 5, scoring = 'accuracy')
print('Average accuracy on a 5 folds cross validation: {}'.format(np.round(scores.mean(), 4)))

Average accuracy on a 5 folds cross validation: 0.9644


It is not a suprise to have a huge increase in perfomance in the cross validation. Most of the training instances on which we are evaluating the classifier are artificial, so it seems reasonable to assume that our classifier is getting better in detecting pattern in the latent space.

### Building a GAN

A Gan is a combination of two models: a generator and a discriminator. The first has to produce fake instances, whereas the second one has to distinguish between true data and artificial ones. The training process is reflected on a zero-sum games, in which the two model try to overcome the other one. After the training phase we can use the generator to produce new instances that look like the actual ones

In [18]:
codings_size = 12

generator = keras.models.Sequential([keras.layers.Dense(100, activation = "selu", input_shape = [codings_size]),
                                     keras.layers.Dense(150, activation = "selu"),
                                     keras.layers.Dense(X_train[0].shape[0] + 1, activation = "sigmoid"),
                                     keras.layers.Reshape([X_train[0].shape[0] + 1, 1])])

discriminator = keras.models.Sequential([keras.layers.Flatten(input_shape=[X_train[0].shape[0] + 1, 1]),
                                         keras.layers.Dense(150, activation="selu"),
                                         keras.layers.Dense(100, activation="selu"),
                                         keras.layers.Dense(1, activation="sigmoid")])

gan = keras.models.Sequential([generator, discriminator])

discriminator.compile(loss = 'binary_crossentropy', optimizer = 'nadam')
discriminator.trainable = False
gan.compile(loss = 'binary_crossentropy', optimizer = 'nadam')

### Training

In [19]:
def train(gan, data, batch_size, codings_size, n_epochs = 60):
    generator, discriminator = gan.layers
    for epoch in range(n_epochs):
        print("Epoch {}/{}".format(epoch + 1, n_epochs))           
        for X_batch in dataset:
            noise = tf.random.normal(shape = [batch_size, codings_size])
            generated_instances = generator(noise)
            d = []
            for instance in generated_instances:
                instance = tf.cast(instance, 'float64')
                d.append(instance)
            d = tf.constant(np.array(d).reshape(-1, X_train[0].shape[0] + 1))
            X_fake_and_real = tf.concat([d, X_batch], axis = 0)
            y1 = tf.constant([[0.]] * batch_size + [[1.]] * batch_size)
            discriminator.trainable = True
            discriminator.train_on_batch(X_fake_and_real, y1)
            noise = tf.random.normal(shape = [batch_size, codings_size])
            y2 = tf.constant([[1.]] * batch_size)
            discriminator.trainable = False
            gan.train_on_batch(noise, y2)
            
batch_size = 20
codings_size = 12
dataset = tf.data.Dataset.from_tensor_slices(data).shuffle(200)
dataset = dataset.batch(batch_size, drop_remainder = True).prefetch(1)
train(gan, dataset, batch_size, codings_size)

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


### Generating data

In [20]:
noise = tf.random.normal(shape = [batch_size * 250, codings_size])
data_gan = generator(noise).numpy().reshape(-1, X_train[0].shape[0] + 1)
augmented_data_gan = np.concatenate([train_set, data_gan], axis = 0)
np.random.shuffle(augmented_data_gan)
aug_X_gan = augmented_data_gan[:, :-1]
aug_y_gan = augmented_data_gan[:, -1]
aug_y_gan = [round(x) for x in aug_y_gan]

In [21]:
scores = cross_val_score(clf, aug_X_gan, aug_y_gan, cv = 5, scoring = 'accuracy')
print('Average accuracy on a 5 folds cross validation: {}'.format(np.round(scores.mean(), 4)))

Average accuracy on a 5 folds cross validation: 0.9141


### Evaluation on production

Let us start looking for the accuracy of a xgb classifier trained on the normal data

In [22]:
clf = xgb.XGBClassifier(n_estimators = 200, learning_rate = 0.1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

from sklearn.metrics import accuracy_score

print('Accuracy of the classifier trained on the regular data: ', np.round(accuracy_score(y_pred, y_test), 10))

Accuracy of the classifier trained on the regular data:  0.6475409836


In [23]:
clf = xgb.XGBClassifier(n_estimators = 200, learning_rate = 0.1)
clf.fit(aug_X, aug_y)
y_pred_var_ae = clf.predict(X_test)

from sklearn.metrics import accuracy_score

print('Accuracy of the classifier trained on variational autoencoder data: ', np.round(accuracy_score(y_pred_var_ae, 
                                                                                                      y_test), 10))

Accuracy of the classifier trained on variational autoencoder data:  0.6639344262


In [25]:
clf = xgb.XGBClassifier(n_estimators = 200, learning_rate = 0.1)
clf.fit(aug_X_gan, aug_y_gan)
y_pred_gan = clf.predict(X_test)

from sklearn.metrics import accuracy_score

print('Accuracy of the classifier trained on GAN data: ', np.round(accuracy_score(y_pred_gan, y_test), 4))

Accuracy of the classifier trained on GAN data:  0.6393


### Conclusion 

Even though the data augmentation process seems to achieve good results, we have to be careful with it. We have to keep in mind that we are not actually generating new data, but instead we are just increasing the number of observations without adding new information. Anyway this process can make a model more stable to outliers or anomalities, so can be seen as one of our last attempt to deal with small datasets