# Test-03
Ejecutado en **Kaggle**.

## Descripción
En esta prueba nos enfocamos en realizar pruebas de data augmentation, probando bibliotecas como **Albumentations** y **Keras**, se crearon generadores de imágenes con transformaciones para simular un conjunto de datos más grande, empleando la técnica 'on the fly' porque la generación estática ocupaba más memoria RAM de la disponible. No funcionó correctamente aplicando la transformación de a batch, pero sí en pasadas de entrenamiento, de forma tal que se reentrena el modelo sobre el mismo conjunto de 40.000 imágenes transformadas.

La conclusión, fue que un modelo mejorado respecto de las pruebas anteriores y simil a la arquitectura VGG16 en algunos aspectos, presentó una métrica igual a la mejor obtenida hasta el momento, pero con menos overfitting. Entonces, es probable que esta técnica haya rendido frutos y los siguientes pasos serán crear redes más complejas que podremos entrenar con menos problema usando data augmentation.

# Fuentes

### Link: https://towardsdatascience.com/simple-image-data-augmentation-technics-to-mitigate-overfitting-in-computer-vision-2a6966f51af4
Explicación general sobre técnicas de **data augmentation** orientadas a conjuntos de datos de imágenes.

### Link: https://albumentations.ai/docs/getting_started/image_augmentation/
Página oficial de la biblioteca **Albumentations**, utilizada para crear pipelines con operaciones aleatorias a realizar sobre las imágenes para generar un conjunto de datos más grande.

### Link: https://medium.com/the-artificial-impostor/custom-image-augmentation-with-keras-70595b01aeac
Es interesante observar cómo se puede hacer aplicar data augmentation 'on the fly' sobre cada batch, para evitar que el conjunto de datos crezca demasiado en memoria y que no pueda manejarse.

In [1]:
import os

In [2]:
import numpy as np

In [3]:
import seaborn as sns

In [4]:
import matplotlib.pyplot as plt

In [6]:
sns.set(style='darkgrid', context='notebook')

## Cargando las bases de datos

In [8]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/rn2021q1-itba-cifar100/y_train.npy
/kaggle/input/rn2021q1-itba-cifar100/x_test.npy
/kaggle/input/rn2021q1-itba-cifar100/x_train.npy


In [9]:
x_train_valid = np.load('/kaggle/input/rn2021q1-itba-cifar100/x_train.npy')
y_train_valid = np.load('/kaggle/input/rn2021q1-itba-cifar100/y_train.npy')
x_test = np.load('/kaggle/input/rn2021q1-itba-cifar100/x_test.npy')

# Separando conjuntos para entrenamiento y validación

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
x_train, x_valid, y_train, y_valid = train_test_split(x_train_valid, y_train_valid, test_size=0.2, random_state=15, stratify=y_train_valid)

# Normalización de los datos

In [12]:
x_valid_norm = x_valid / 255
x_test_norm = x_test / 255
x_train_norm = x_train / 255

# Data Augmentation con Albumentation
Se aplica **data augmentation** utilizando la biblioteca Albumentations para aumentar el tamaño del conjunto de datos para entrenamiento. El conjunto de validación permanece intacto para validar que la técnica tuvo buenos resultados sin contaminar los datos de dicho conjunto.

In [13]:
from tensorflow.keras.utils import Sequence

In [14]:
from albumentations import Compose, ToFloat, HorizontalFlip, VerticalFlip, Rotate
from albumentations import RandomBrightnessContrast

In [15]:
class AugmentedSequence(Sequence):
  """ Dataset generator with data augmentation """

  def __init__(self, x, y, batch_size, augmentation, shuffle=True):
    """ Create an instance of the data augmented generator, which is a 
        dataset generator to provide 'on the fly' data augmentation.
        @param x
        @param y
        @param batch_size
        @param augmentation
        @param shuffle
    """
    # Save internal parameters of the augmented sequence
    self.x = x
    self.y = y
    self.batch_size = batch_size
    self.augmentation = augmentation
    self.shuffle = shuffle

    # Initialization
    self.on_epoch_end()
  
  def __len__(self):
    """ Compute the length of an epoch measured in batches
    """
    return int(np.floor(len(self.x) / float(self.batch_size)))
  
  def __getitem__(self, index):
    """ Return the item from the sequence at the given index
        @param index
    """
    # Generate indexes of the batch
    indexes = self.indexes[index * self.batch_size : (index + 1) * self.batch_size]

    # Extract the input and output batch from the original dataset
    batch_x = self.x[indexes]
    batch_y = self.y[indexes]
    
    # Return an augmented version of the batch
    return np.array([
      self.augmentation(image=x)['image'] for x in batch_x
    ]), np.array(batch_y)

  def on_epoch_end(self):
    """ Updates indexes after each epoch
    """
    self.indexes = np.arange(len(self.x))
    if self.shuffle is True:
        np.random.shuffle(self.indexes)


In [16]:
# Create the AugmentedSequence
album_generator = AugmentedSequence(x_train,
                                    y_train,
                                    40000,
                                    Compose([
                                            Rotate(),
                                            HorizontalFlip(),
                                            VerticalFlip(),
                                            RandomBrightnessContrast(),
                                            ToFloat()
                                    ])
                                    )

# Data Augmentation con Keras ImageDataGenerator
Se aplica **data augmentation** utilizando la biblioteca Keras para aumentar el tamaño del conjunto de datos para entrenamiento. El conjunto de validación permanece intacto para validar que la técnica tuvo buenos resultados sin contaminar los datos de dicho conjunto.

In [17]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [18]:
# Create the data generator with Keras preprocessing library
datagen = ImageDataGenerator(
    rotation_range=30,
    zoom_range=0.25,
    horizontal_flip=True,
    vertical_flip=True,
    width_shift_range=0.15,
    height_shift_range=0.15
)

# Keras dataset generator
keras_generator = datagen.flow(x_train_norm, y_train, batch_size=40000)

# Modelos

In [19]:
from keras.layers import Dense, Flatten, Activation, BatchNormalization, Dropout
from keras.layers import Input, Conv2D, MaxPooling2D, InputLayer, AveragePooling2D
from keras.models import Sequential, Model
from keras.callbacks import TensorBoard, ModelCheckpoint
from keras.optimizers import Adam
from keras.regularizers import l2

In [20]:
import keras

## Modelo # 1

In [21]:
from keras.applications.vgg16 import VGG16

In [22]:
# Create an instance of the VGG16 to use Transfer Learning and disable the trainable option
vgg = VGG16(include_top=False, weights='imagenet', input_shape=(32, 32, 3))
for layer in vgg.layers:
        layer.trainable = False

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5


In [23]:
# Create layers
model = Sequential()
model.add(vgg)
model.add(Flatten())
model.add(Dense(units=1024))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(units=100))
model.add(BatchNormalization())
model.add(Activation('softmax'))

# Compile
model.compile(loss='sparse_categorical_crossentropy',
              optimizer=Adam(learning_rate=0.001),
              metrics=['accuracy']
             )

In [24]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
vgg16 (Functional)           (None, 1, 1, 512)         14714688  
_________________________________________________________________
flatten (Flatten)            (None, 512)               0         
_________________________________________________________________
dense (Dense)                (None, 1024)              525312    
_________________________________________________________________
batch_normalization (BatchNo (None, 1024)              4096      
_________________________________________________________________
activation (Activation)      (None, 1024)              0         
_________________________________________________________________
dropout (Dropout)            (None, 1024)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 100)               1

In [25]:
# Create the ModelCheckpoint callback to save the best model during training
mc_callback = ModelCheckpoint('model_1.hdf5',
                              monitor='val_accuracy',
                              save_best_only=True,
                              verbose=0,
                              mode='max'
                             )

# Train the model
epochs = 5
batch_size = 512
augmented_factor = 20
for i in range(augmented_factor):
  # Extract train set
  batch_x, batch_y = album_generator[0]
    
  # Training the top model
  model.fit(batch_x,
            batch_y, 
            validation_data=(x_valid_norm, y_valid), 
            callbacks=[mc_callback],
            batch_size=batch_size,
            epochs=epochs
            )

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [26]:
# Load the model and show the final metrics
model = keras.models.load_model('model_1.hdf5')

# Train and validation metrics
_, train_acc = model.evaluate(x_train_norm, y_train, verbose=0)
_, valid_acc = model.evaluate(x_valid_norm, y_valid, verbose=0)

# Show result
print(f'[Accuracy] Train: {round(train_acc, 3)} Valid: {round(valid_acc, 3)}')

[Accuracy] Train: 0.597 Valid: 0.34


## Modelo #2

In [27]:
# Create the model
model = Sequential()
model.add(InputLayer(input_shape=(32, 32, 3)))
model.add(Conv2D(16, 3, padding='same'))
model.add(Conv2D(16, 3, padding='same'))
model.add(BatchNormalization())
model.add(Activation('elu'))
model.add(MaxPooling2D())
model.add(Conv2D(32, 3, padding='same'))
model.add(Conv2D(32, 3, padding='same'))
model.add(Conv2D(32, 3, padding='same'))
model.add(BatchNormalization())
model.add(Activation('elu'))
model.add(MaxPooling2D())
model.add(Conv2D(64, 3, padding='same'))
model.add(Conv2D(64, 3, padding='same'))
model.add(Conv2D(64, 3, padding='same'))
model.add(BatchNormalization())
model.add(Activation('elu'))
model.add(MaxPooling2D())
model.add(Flatten())
model.add(Dense(256))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Activation('elu'))
model.add(Dense(256))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Activation('elu'))
model.add(Dense(100))
model.add(BatchNormalization())
model.add(Activation('softmax'))

# Compile
model.compile(loss='sparse_categorical_crossentropy',
              optimizer=Adam(learning_rate=0.001),
              metrics=['accuracy']
             )

In [28]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 32, 32, 16)        448       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 32, 32, 16)        2320      
_________________________________________________________________
batch_normalization_2 (Batch (None, 32, 32, 16)        64        
_________________________________________________________________
activation_2 (Activation)    (None, 32, 32, 16)        0         
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 16, 16, 16)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 16, 16, 32)        4640      
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 16, 16, 32)       

In [29]:
# Create the ModelCheckpoint callback to save the best model during training
mc_callback = ModelCheckpoint('model_2.hdf5',
                              monitor='val_accuracy',
                              save_best_only=True,
                              verbose=0,
                              mode='max'
                             )

# Train the model
epochs = 5
batch_size = 512
augmented_factor = 20
for i in range(augmented_factor):
  # Extract train set
  batch_x, batch_y = album_generator[0]
    
  # Training the top model
  model.fit(batch_x,
            batch_y, 
            validation_data=(x_valid_norm, y_valid), 
            callbacks=[mc_callback],
            batch_size=batch_size,
            epochs=epochs
            )

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [30]:
# Load the model and show the final metrics
model = keras.models.load_model('model_2.hdf5')

# Train and validation metrics
_, train_acc = model.evaluate(x_train_norm, y_train, verbose=0)
_, valid_acc = model.evaluate(x_valid_norm, y_valid, verbose=0)

# Show result
print(f'[Accuracy] Train: {round(train_acc, 3)} Valid: {round(valid_acc, 3)}')

[Accuracy] Train: 0.496 Valid: 0.449
