# Exploring Deep Neural Networks for CIFAR-10 Image Classification

## Introduction

Deep Neural Networks (DNNs) have emerged as powerful tools for various machine learning tasks, particularly in the realm of image classification. In this Jupyter Notebook project, we embark on a journey into deep learning using TensorFlow to tackle one of the most iconic image classification benchmarks: the CIFAR-10 dataset.

CIFAR-10 is a widely utilized benchmark dataset comprising 50,000 32x32 color images across 10 classes, with each class containing 5,000 images. The objective is to classify each image into one of the following categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. While Convolutional Neural Networks (CNNs) are typically the architecture of choice for image classification tasks due to their spatial hierarchical learning capabilities, in this notebook, we will take a different approach. Instead of relying on CNNs, we will investigate the effectiveness of Deep Neural Networks (DNNs) in classifying CIFAR-10 images.

The primary goal of this notebook is to showcase the potential of DNNs in handling image classification tasks even without convolutional layers. By constructing a DNN architecture from scratch using TensorFlow, we aim to gain insights into how DNNs perform on CIFAR-10 and explore their strengths and limitations compared to CNNs.

In [5]:
import tensorflow as tf

tf.random.set_seed(42)

(X_train_full, y_train_full), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()

# Establishing a Validation Set to Assess DNN Performance

In order to gauge the performance of our Deep Neural Network (DNN), it's crucial to establish a validation set.
I believe, that selecting 5000 samples from our dataset should suffice for this purpose.

In [6]:
X_train = X_train_full[5000:]
y_train = y_train_full[5000:]
X_valid = X_train_full[:5000]
y_valid = y_train_full[:5000]

# Creating a Sequential Model with TensorFlow API

## DNN Architecture:

- **Number of layers:** 5
- **Number of neurons per layer:** 100
- **Model type:** Sequential (no skip connections)

## Hyperparameters:

- **Activation Function:**
  - Activation function: SELU (Scaled Exponential Linear Unit)
    - Proven to self-normalize under certain conditions

- **Weight Initialization:**
  - Weight initialization: LeCun initialization
    - Ensures effective learning dynamics

- **Normalization Techniques:**
  - Utilizing SELU activation eliminates the need for normalization techniques such as BatchNormalization or Dropout

- **Input Standardization:**
  - All inputs should be standardized before passing them to the DNN
    - Ensures consistency and improves convergence

- **Regularization Technique:**
  - EarlyStopping regularization employed
    - Utilizes a validation set to monitor loss convergence
    - Allows for flexible control over the number of epochs

- **Optimizer:**
  - Optimizer: Nadam (Nesterov Adam)
    - Known for its effectiveness in optimizing deep neural networks
    - Combines momentum with the adaptive learning rate properties of Adam

## Conditions for SELU to self-normalize:
  - LeCun initialization
  - Sequential architecture of model (no skip connections)
  - Dense or convolutional layers
  - Standardized inputs


In [51]:
def create_model(n_layers=5, n_neurons=100, shape=[32, 32, 3]):
    model = tf.keras.models.Sequential()

    model.add(tf.keras.layers.Flatten(input_shape=shape))
    for _ in range(n_layers):
        model.add(tf.keras.layers.Dense(n_neurons,
                                 activation="selu",
                                 kernel_initializer="lecun_normal"))
    model.add(tf.keras.layers.Dense(10, activation="softmax"))

    optimizer = tf.keras.optimizers.Nadam()
    
    model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])
    return model

# Parameter Tuning and TensorBoard Visualization

Before training our model, one parameter we can tweak is the learning_rate of Nadam's optimizer. Adjusting this parameter can potentially lead to faster convergence and help avoid local optima. To aid us in this process, TensorFlow provides a powerful tool called TensorBoard for data visualization.

To begin, we need to set up a data path for the TensorBoard callback to store the results:

In [22]:
import os

root_logdir = os.path.join(os.curdir, "my_logs")

def get_run_logdir(id_for_run):
 return os.path.join(root_logdir, id_for_run)

To optimize our model's performance, we can experiment with different values of the learning_rate parameter. After training with each value, we can visualize the results using TensorBoard, a powerful tool provided by TensorFlow for data visualization.

In [53]:
norm_layer = tf.keras.layers.Normalization()
norm_layer.adapt(X_train)

X_train_scaled = norm_layer(X_train)
X_valid_scaled = norm_layer(X_valid)

model = create_model()

for i in [1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2]:
    optimizer = tf.keras.optimizers.Nadam(learning_rate=i)
    model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])
    tensorboard_cb = tf.keras.callbacks.TensorBoard(get_run_logdir(str(i)))
    history = model.fit(X_train_scaled, y_train, epochs=10,
                        validation_data=(X_valid_scaled, y_valid),
                        callbacks=[tensorboard_cb])

Epoch 1/10
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 4ms/step - accuracy: 0.2419 - loss: 2.2027 - val_accuracy: 0.3624 - val_loss: 1.7958
Epoch 2/10
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 4ms/step - accuracy: 0.3718 - loss: 1.7927 - val_accuracy: 0.4048 - val_loss: 1.6957
Epoch 3/10
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.4061 - loss: 1.6928 - val_accuracy: 0.4194 - val_loss: 1.6450
Epoch 4/10
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.4302 - loss: 1.6318 - val_accuracy: 0.4284 - val_loss: 1.6127
Epoch 5/10
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.4456 - loss: 1.5868 - val_accuracy: 0.4362 - val_loss: 1.5868
Epoch 6/10
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.4578 - loss: 1.5504 - val_accuracy: 0.4458 - val_loss: 1.5674
Epoch 7/10
[1m1

## Evaluating Learning Rate Performance:

Default learning rate: 1e-4, resulted in the best performance based on the plotted metrics in TensorBoard.

## Exploring Learning Rate Scheduling:

Learning Rate Scheduling offers a dynamic approach to adjusting the learning rate during training. This technique allows the model to adapt and fine-tune its learning rate over time, potentially leading to improved convergence and performance on the dataset. There are many scheduling algorithms, like:
 - Power scheduling
 - Exponential scheduling
 - Piecewise constant scheduling
 - Performance scheduling

# Implementing Performance Scheduling Algorithm with ReduceLROnPlateau Callback

For this task, the Performance Scheduling algorithm is the chosen method for dynamically adjusting the learning rate during training. This algorithm evaluates the validation error every N steps, reducing the learning rate if the error fails to decrease for N consecutive steps.

## Utilizing ReduceLROnPlateau Callback:

Implementing Performance Scheduling in Keras is straightforward with the ReduceLROnPlateau callback. While there are two variants of this callback—one reduces the learning rate every N epochs, and the other reduces it every N steps—we will adopt the former approach for this task.

## Defining full set of callbacks

- Early Stopping callback to prevent overfitting
- TensorBoard callback to visualize data
- ModelCheckpoint callback to save the model when its performance on the validation set is the best so far (save_best_only=True must be set)
- ReduceLROnPlateau callback to reach a good solution faster than with the optimal constant learning rate


In [30]:
root_log_dir = os.path.join(os.curdir, "my_logs_train")

def get_run_logdir():
    import time
    run_id = time.strftime("run_%Y_%m_%d-%H_%M_%S")
    return os.path.join(root_log_dir, run_id)

log_dir = get_run_logdir()
tensorboard_cb = tf.keras.callbacks.TensorBoard(log_dir=log_dir)

In [31]:
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=6, monitor="val_loss", mode='min')
lr_scheduler_cb = tf.keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3)
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_cifar10_model_v1.keras", save_best_only=True)
callbacks = [early_stopping_cb, checkpoint_cb, tensorboard_cb, lr_scheduler_cb]

In [88]:
model = create_model()
optimizer = tf.keras.optimizers.Nadam(learning_rate=1e-4)
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

model.fit(X_train_scaled, y_train, epochs=100,
          validation_data=(X_valid_scaled, y_valid),
          callbacks=callbacks)

Epoch 1/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 4ms/step - accuracy: 0.3332 - loss: 1.9311 - val_accuracy: 0.4328 - val_loss: 1.6049 - learning_rate: 1.0000e-04
Epoch 2/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 4ms/step - accuracy: 0.4513 - loss: 1.5532 - val_accuracy: 0.4598 - val_loss: 1.5319 - learning_rate: 1.0000e-04
Epoch 3/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 4ms/step - accuracy: 0.4952 - loss: 1.4375 - val_accuracy: 0.4734 - val_loss: 1.4960 - learning_rate: 1.0000e-04
Epoch 4/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 4ms/step - accuracy: 0.5212 - loss: 1.3562 - val_accuracy: 0.4856 - val_loss: 1.4711 - learning_rate: 1.0000e-04
Epoch 5/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 4ms/step - accuracy: 0.5445 - loss: 1.2929 - val_accuracy: 0.4868 - val_loss: 1.4573 - learning_rate: 1.0000e-04
Epoch 6/100
[1m1407/1407[0m [32m

<keras.src.callbacks.history.History at 0x2a3527c4310>

In [89]:
model = tf.keras.models.load_model("my_cifar10_model_v1.keras")
model.evaluate(X_valid_scaled, y_valid)

[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.5115 - loss: 1.4395


[1.440427541732788, 0.5091999769210815]

In [90]:
model.evaluate(X_train_scaled, y_train)

[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.6457 - loss: 1.0136


[1.0154922008514404, 0.6470444202423096]

It took us only 11 epochs to reach this result, after the 11th epoch model's there was no significant increase in validation accuracy, whereas training accuracy kept increasing, meaning that the model started to overfit the data significantly. We can try building new neural network, but now using ELU activation function together with BatchNormalization and whether it works better or not.

In [39]:
model_bn = tf.keras.models.Sequential()

model_bn.add(tf.keras.layers.Flatten(input_shape=[32, 32, 3]))
model_bn.add(tf.keras.layers.BatchNormalization())

for _ in range(5):
    model_bn.add(tf.keras.layers.Dense(100,
                                 kernel_initializer="he_normal",
                                 kernel_constraint=tf.keras.constraints.max_norm(1.)))
    model_bn.add(tf.keras.layers.BatchNormalization())
    model_bn.add(tf.keras.layers.Activation("elu"))
model_bn.add(tf.keras.layers.Dense(10, activation="softmax"))

optimizer = tf.keras.optimizers.Nadam(learning_rate=1e-3)
model_bn.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

We can set learning rate to 0.001, since we are using Performance scheduling.

In [40]:
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=6, monitor="val_loss")
lr_scheduler_cb = tf.keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3)
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_cifar10_model_bn_v1.keras", save_best_only=True)
callbacks = [early_stopping_cb, checkpoint_cb, tensorboard_cb, lr_scheduler_cb]

In [41]:
model_bn.fit(X_train_scaled, y_train, epochs=100,
          validation_data=(X_valid_scaled, y_valid),
          callbacks=callbacks)


Epoch 1/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 6ms/step - accuracy: 0.3550 - loss: 1.8220 - val_accuracy: 0.4190 - val_loss: 1.6278 - learning_rate: 0.0010
Epoch 2/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 5ms/step - accuracy: 0.4234 - loss: 1.6046 - val_accuracy: 0.4418 - val_loss: 1.5864 - learning_rate: 0.0010
Epoch 3/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 5ms/step - accuracy: 0.4411 - loss: 1.5666 - val_accuracy: 0.4348 - val_loss: 1.5744 - learning_rate: 0.0010
Epoch 4/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 7ms/step - accuracy: 0.4482 - loss: 1.5412 - val_accuracy: 0.4432 - val_loss: 1.5606 - learning_rate: 0.0010
Epoch 5/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 15ms/step - accuracy: 0.4577 - loss: 1.5186 - val_accuracy: 0.4568 - val_loss: 1.5380 - learning_rate: 0.0010
Epoch 6/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x1f34edf2c10>

In [42]:
model_bn = tf.keras.models.load_model("my_cifar10_model_bn_v1.keras")
model_bn.evaluate(X_valid_scaled, y_valid)

[1m157/157[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.5033 - loss: 1.4019


[1.4019014835357666, 0.5059999823570251]

In [43]:
model_bn.evaluate(X_train_scaled, y_train)

[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.5819 - loss: 1.1800


[1.1776326894760132, 0.5823333263397217]

## Conclusion

After training and evaluating the second model, we can conclude the following:

- **Validation Accuracy**: There was no significant difference in validation accuracy between the two models. Both models performed similarly in terms of predicting on unseen data.

- **Gap Between Validation and Training Accuracies**: The second model exhibited a much smaller gap between validation and training accuracies compared to the first model. This suggests that the second model may have better generalization performance and could benefit from further tuning.

## Next Steps

We can further explore and fine-tune the parameters of the second model to enhance its performance. Potential steps include:
- Hyperparameter tuning: Experiment with different learning rates, batch sizes, and optimizer configurations.
- Regularization techniques: Apply dropout, L1/L2 regularization, or other regularization methods to prevent overfitting (other than Max_norm regularization).
- Model architecture modifications: Adjust the number of layers, units per layer, or try alternative activation functions to improve performance.

Overall, the findings suggest that the second model shows promise and warrants further investigation to unlock its full potential.
 

Fine-tuning becomes an easy job, when it comes to Keras Tuning library. We need to build a class that will have two methods: build() and fit(), build should return a model with new hyperparameters entered and fit is used to fit the model, but we can utilize it to decide how to preprocess the data or tweak batch size, and so on, based on hyperparameters passed to it. 

In [54]:
X_train = X_train_full[5000:]
y_train = y_train_full[5000:]
X_valid = X_train_full[:5000]
y_valid = y_train_full[:5000]

norm_layer = tf.keras.layers.Normalization()
norm_layer.adapt(X_train)

X_train_scaled = norm_layer(X_train)
X_valid_scaled = norm_layer(X_valid)

In [55]:
import keras_tuner as kt

def build_model(hp):
        n_hidden = hp.Int("n_hidden", default=5, min_value=1, max_value=10)
        n_neurons = hp.Int("n_neurons", default=100, min_value=64, max_value=256)
        learning_rate = hp.Float("learning_rate", min_value=1e-5, max_value=1e-3)
        
        model_bn = tf.keras.models.Sequential()

        model_bn.add(tf.keras.layers.Flatten(input_shape=[32, 32, 3]))
        model_bn.add(tf.keras.layers.BatchNormalization())

        for _ in range(n_hidden):
            model_bn.add(tf.keras.layers.Dense(n_neurons,
                                 kernel_initializer="he_normal",
                                 kernel_constraint=tf.keras.constraints.max_norm(1.)))
            model_bn.add(tf.keras.layers.BatchNormalization())
            model_bn.add(tf.keras.layers.Activation("elu"))
            
        model_bn.add(tf.keras.layers.Dense(10, activation="softmax"))

        optimizer = tf.keras.optimizers.Nadam(learning_rate=learning_rate)
        model_bn.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])
        return model_bn

In [58]:
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=6, monitor="val_loss")
lr_scheduler_cb = tf.keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3)
callbacks = [early_stopping_cb, lr_scheduler_cb]

bayesian_tuner = kt.BayesianOptimization(
    build_model, objective="val_accuracy", max_trials=10,
    alpha=1e-4, beta=2.6, seed=42, directory="randomtuning",
    project_name="cifar10_usingDNN", overwrite=True
)
bayesian_tuner.search(X_train_scaled, y_train, epochs=10,
                      validation_data=(X_valid_scaled, y_valid),
                      callbacks=callbacks)    

Trial 10 Complete [00h 01m 48s]
val_accuracy: 0.5180000066757202

Best val_accuracy So Far: 0.5198000073432922
Total elapsed time: 00h 13m 48s


In [64]:
best_model = bayesian_tuner.get_best_models(num_models=1)[0]
best_model.evaluate(X_train_scaled, y_train)

[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.6308 - loss: 1.0289


[1.0325251817703247, 0.6324666738510132]

## Exploring Basic Fine-Tuning

In this section, we delve into the results of basic fine-tuning, focusing on fundamental DNN architectures without extensive optimization. The objective here is to gain insights into working with simple DNNs and understanding popular real-world architectures.

### Initial Results

The initial fine-tuning results are summarized below:

- *Accuracy*: The accuracy achieved with basic fine-tuning serves as a baseline for further exploration.
- *Optimizer Selection*: No optimization regarding the choice of optimizer, max norm regularization value, batch size, etc., has been performed yet. This allows us to grasp the essence of DNN architectures without delving into intricate optimization strategies.

### Conclusion

Through basic fine-tuning, we have gained valuable insights into working with simple DNN architectures. Moving forward, we will explore optimization strategies and delve deeper into enhancing model performance while considering the need for lightning-fast response in certain scenarios.


In [65]:
X_test_scaled = norm_layer(X_test)
best_model.evaluate(X_test_scaled, y_test)

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.5150 - loss: 1.4133


[1.4172695875167847, 0.5113000273704529]

### Different thoughts on what can be explored further (Considerations for Lightning-Fast Response)

While my primary focus has been on  finding the best architecture for this problem and basic fine-tuning, I acknowledge the importance of lightning-fast response in certain applications. Here are some considerations for achieving rapid predictions:

- *Activation Functions*: ReLU or Leaky ReLU can be preferred over SeLU for faster response times, especially in self-normalizing models.
- *Sparse Models*: Implementing sparse models, possibly using l1 regularization, can aid in achieving rapid predictions. However, this requires adjustments such as replacing SeLU with ReLU, normalization through BatchNormalization, among others.