# Hybridization of Traditional and GAN-based Augmentations
- In this kernel, we will be performing the Hybridization of **Traditional Augmentation** and **GAN-Based Augmentation** approaches. 
- In the traditional and GAN-based augmentation kernels, we have tried 3 different approaches in each. In this manner, we can have 9 different combinations for this hybridization. However, in this kernel, we will only be using a single combination. From both the kernels, we will be selecting the approach with the best test-set accuracies, and combining them only. 
- From the traditional augmentation kernel, we will be selecting 'Augmentation for class balancing' (**77.46% accuracy**), and from the GAN-based augmentation kernel, we will be selecting 'Augmentation based on class-wise performance' (**76.64% accuracy**).
- We will apply the aforementioned approaches individually on the training dataset, and then will be merging both of the augmented datasets with the training set.

### Reference Kernels
- [Traditional Augmentation](https://www.kaggle.com/code/elemento/rw-tradaug)
- [GANs Augmentation 1](https://www.kaggle.com/code/elemento/rw-ganaug-1) and [GANs Augmentation 2](https://www.kaggle.com/code/elemento/rw-ganaug-2)

# 1. Importing the Packages & Boilerplate Code

In [1]:
import os
import sys
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tqdm import tqdm
from shutil import copyfile
from tabulate import tabulate
from sklearn.metrics import accuracy_score, log_loss, confusion_matrix

# https://www.kaggle.com/c/ventilator-pressure-prediction/discussion/274717
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

import tensorflow as tf
import tensorflow.keras.layers as tfl

In [2]:
# Setting the seeds
SEED = 0
os.environ['PYTHONHASHSEED']=str(SEED)
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

In [3]:
# Making sure that Tensorflow is able to detect the GPU
device_name = tf.test.gpu_device_name()
if "GPU" not in device_name:
    print("GPU device not found")
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


# 2. Importing the Train and Test Sets

In [4]:
# Importing the Labelled Training Dataset
print("For Train Dataset:")
df_train = pd.read_csv("../input/cifar10/train_lab_x.csv")
y_train = pd.read_csv("../input/cifar10/train_lab_y.csv")
df_train = np.array(df_train)
y_train = np.array(y_train)
print(df_train.shape, y_train.shape)

# Reshaping, rescaling and one-hot encoding
df_train = np.reshape(df_train, (-1, 3, 32, 32))
df_train = np.transpose(np.array(df_train), (0, 2, 3, 1))
df_train = df_train / 255
print(df_train.shape)

# Importing the Test Dataset
print("For Test Dataset:")
df_test = pd.read_csv("../input/cifar10/test_x.csv")
y_test = pd.read_csv("../input/cifar10/test_y.csv")
df_test = np.array(df_test)
y_test = np.array(y_test)
print(df_test.shape, y_test.shape)

# Reshaping the dataset
df_test = np.reshape(df_test, (-1, 3, 32, 32))
print(df_test.shape)

# Reshaping, rescaling and one-hot encoding
df_test = np.transpose(np.array(df_test), (0, 2, 3, 1))
df_test = df_test / 255
y_test_oh = tf.one_hot(np.ravel(y_test), depth = 10)
print(df_test.shape, y_test_oh.shape)

For Train Dataset:
(40006, 3072) (40006, 1)
(40006, 32, 32, 3)
For Test Dataset:
(10000, 3072) (10000, 1)
(10000, 3, 32, 32)
(10000, 32, 32, 3) (10000, 10)


# 3. Performing the Augmentations on the Training Set
## 3.1. GAN-Based Augmentation

In [5]:
df_gan_aug = pd.read_csv("../input/cifar10/df_clsper_aug.csv")
y_gan_aug = pd.read_csv("../input/cifar10/y_clsper_aug.csv")
df_gan_aug = np.array(df_gan_aug)
y_gan_aug = np.array(y_gan_aug)

# Reshaping, rescaling and one-hot encoding
df_gan_aug = np.reshape(df_gan_aug, (-1, 3, 32, 32))
df_gan_aug = np.transpose(np.array(df_gan_aug), (0, 2, 3, 1))
print(df_gan_aug.shape, y_gan_aug.shape)

(10048, 32, 32, 3) (10048, 1)


## 3.2. Traditional Augmentation

In [6]:
y_train_reshape = np.reshape(y_train, (-1))
num_examples = np.zeros((10,))

for i in y_train_reshape:
    num_examples[i] += 1

# Number of examples from each class
num_exa = num_examples.astype('int32')

# Finding out the maximum number of examples for any class
max_exa = max(num_exa)

# Number of examples that needs to be added to each of the classes
aug_exa = [max_exa - num_exa[i] for i in range(10)]

# Creating a list of lists for storing the indices of data-points in the training dataset, class-wise
classes_ind = []
for i in range(10):
    classes_ind.append([])

for ind, clss in enumerate(y_train_reshape):
    classes_ind[clss].append(ind)

# # Transforming list of lists into numpy array
# classes_ind = np.array([np.array(xi) for xi in classes_ind])

print(num_exa)
print(aug_exa, sum(aug_exa))
print(len(classes_ind), len(classes_ind[0]))

# Creating a list for indices of images and their labels on which augmentation needs to be done
# These are randomly chosen from each class
aug_ind = []
y_trad_aug = []

for i in range(10):
    indices = random.choices(classes_ind[i], k = aug_exa[i])
    aug_ind.extend(indices)
    y_trad_aug.extend([i]*aug_exa[i])

print(len(aug_ind), len(y_trad_aug))

[4109 3839 4022 4116 4312 3952 4290 3552 3436 4378]
[269, 539, 356, 262, 66, 426, 88, 826, 942, 0] 3774
10 4109
3774 3774


In [7]:
data_augmentation = tf.keras.Sequential([
    tfl.RandomFlip("horizontal"),
    tfl.RandomRotation(0.1),
])

# Creating an empty list
df_trad_aug = []

# Iterating over all the images in the dataset
for ind in tqdm(aug_ind):
    aug_image = data_augmentation(df_train[ind, : , : , : ])
    df_trad_aug.append(aug_image)

# Sanity Checks and Transformations
df_trad_aug = np.array(df_trad_aug)
y_trad_aug = np.reshape(np.array(y_trad_aug), (-1, 1))
print(df_trad_aug.shape, y_trad_aug.shape)

100%|██████████| 3774/3774 [00:28<00:00, 130.97it/s]


(3774, 32, 32, 3) (3774, 1)


## 3.3. Preparing the Augmented Training Set

In [8]:
# Concatenating the Training with Augmenting Dataset
df_aug = np.concatenate([df_train, df_gan_aug, df_trad_aug], axis=0)
y_aug = np.concatenate([y_train, y_gan_aug, y_trad_aug], axis=0)

# Creating a random permutation & shuffling the dataset
perm = np.random.permutation(df_aug.shape[0])
df_aug = np.array(df_aug[perm, : , : , : ])
y_aug = y_aug[perm]

# One-Hot Encoding
y_aug_oh = tf.one_hot(np.ravel(y_aug), depth = 10)
print(df_aug.shape, y_aug.shape, y_aug_oh.shape)

(53828, 32, 32, 3) (53828, 1) (53828, 10)


# 4. Training the Model
## 4.1. Preparing the Baseline Model and the Augmented Training Set

In [9]:
# Importing the Baseline Model Architecture
copyfile(src = "../input/dcai-rw/baseline_arch.py", dst = "../working/baseline_arch.py")
from baseline_arch import cnn_model

# Creating Batches from the Augmented Dataset
train_dataset = tf.data.Dataset.from_tensor_slices((df_aug, y_aug_oh)).batch(32)

In [10]:
num_epochs = [10, 20, 30, 40, 50]
train_loss, test_loss, train_acc, test_acc = [], [], [], []

for epochs in num_epochs:
    # Training the Model
    conv_model = cnn_model((32, 32, 3))
    conv_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics='accuracy')
    conv_model.fit(train_dataset, epochs = epochs)
    
    # Predicting on the Train/Test Datasets
    preds_train = conv_model.predict(df_aug)
    preds_test = conv_model.predict(df_test)

    # Finding the Predicted Classes
    cls_train = np.argmax(preds_train, axis = 1)
    cls_test = np.argmax(preds_test, axis = 1)
    
    # Finding the Train/Test set Loss
    train_loss.append(log_loss(y_aug_oh, preds_train))
    test_loss.append(log_loss(y_test_oh, preds_test))
    train_acc.append(accuracy_score(y_aug, cls_train))
    test_acc.append(accuracy_score(y_test, cls_test))
    
    print("For ", epochs, " Epochs:")
    print("Log-loss for Train Dataset = ", train_loss[-1])
    print("Log-loss for Test Dataset = ", test_loss[-1])
    print("Accuracy for Train Dataset = ", train_acc[-1])
    print("Accuracy for Test Dataset = ", test_acc[-1])
    print()

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
For  10  Epochs:
Log-loss for Train Dataset =  0.4559950406188158
Log-loss for Test Dataset =  0.7846033715799916
Accuracy for Train Dataset =  0.8395073196106115
Accuracy for Test Dataset =  0.7329

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
For  20  Epochs:
Log-loss for Train Dataset =  0.25285143427824347
Log-loss for Test Dataset =  0.7247940823813778
Accuracy for Train Dataset =  0.917645091773798
Accuracy for Test Dataset =  0.757

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoc

In [11]:
# Training the Model with the best hyper-parameter settings
ind = np.argmax(test_acc)
best_num_epochs = num_epochs[ind]
conv_model = cnn_model((32, 32, 3))
conv_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics='accuracy')
conv_model.fit(train_dataset, epochs = best_num_epochs)

# Saving the model along with it's weights
conv_model.save('hybrid_trad_gan_augmented.h5')

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


## 4.2. Predicting the Performance

In [12]:
# Predicting on the Train/Test Datasets
preds_train = conv_model.predict(df_aug)
preds_test = conv_model.predict(df_test)

# Finding the Predicted Classes
cls_train = np.argmax(preds_train, axis = 1)
cls_test = np.argmax(preds_test, axis = 1)

# Finding the Train/Test set Loss
print("Log-loss for Augmented Dataset = ", log_loss(y_aug_oh, preds_train))
print("Log-loss for Test Dataset = ", log_loss(y_test_oh, preds_test))
print("Accuracy for Augmented Dataset = ", accuracy_score(y_aug, cls_train))
print("Accuracy for Test Dataset = ", accuracy_score(y_test, cls_test))

Log-loss for Augmented Dataset =  0.14989328262409068
Log-loss for Test Dataset =  0.8253970891010501
Accuracy for Augmented Dataset =  0.9504904510663595
Accuracy for Test Dataset =  0.7595
