## Things to test out:

* **Custom Model Training:** Since your dataset is very unique and likely not well-represented in pre-trained models, training a custom model from scratch might be the most effective approach. This involves designing a Convolutional Neural Network (CNN) architecture that is specifically tailored to your dataset's characteristics. You can start with a simple architecture and gradually increase its complexity as needed. Training a custom model from scratch allows you to fully leverage your dataset, but it requires a significant amount of computational resources and time.

* **Efficient Model Architecture:**  Given the large number of classes, you might need to experiment with different model architectures to find one that balances accuracy and computational efficiency. Consider using architectures that are designed to handle a large number of classes, such as **DenseNet** or **Xception**. These architectures can help manage the complexity of a large number of classes without requiring an excessive amount of computational resources.

* **Regularization Techniques:** To prevent overfitting, especially with a large number of classes, consider using regularization techniques such as **L1** or **L2** regularization. These techniques add a penalty to the loss function, encouraging the model to **learn simpler, more generalized features**.



### TLDR:
* **Architecture:**
  * DenseNet
  * Xception

* **Regularization:**
  * L1/L2

## DenseNet

[Keras DenseNet Architectures](https://www.tensorflow.org/api_docs/python/tf/keras/applications/densenet)

[Keras DenseNet Applications](https://keras.io/api/applications/densenet/)

For **DenseNet**, considering your scenario with 17,000 unique classes, here are some considerations:

**Complexity and Accuracy:** DenseNet addresses the vanishing gradient problem by connecting each layer to all preceding layers, which allows for efficient use of parameters and helps in capturing features at various levels of abstraction. This architecture can potentially offer high accuracy for complex datasets, as it allows for the sharing of features across layers, which can be beneficial for tasks with a large number of classes 3.

**Training Time:** DenseNet's architecture, while efficient in terms of parameter usage, can be computationally intensive due to its dense connectivity pattern. This means that DenseNet models, especially those with a high number of layers (e.g., DenseNet201), can take a significant amount of time to train. If training time is a concern, you might need to consider models with fewer layers, such as DenseNet121 or DenseNet169 3.

**Model Size:** DenseNet models, especially the larger versions (e.g., DenseNet201), can be quite large in terms of the number of parameters. This is due to the dense connectivity pattern, which requires a significant number of parameters to maintain the connections between layers. The model size can be a consideration if you're deploying the model on a server with limited storage 3.

**Residual Connections:** DenseNet does not use residual connections in the traditional sense as seen in ResNet. Instead, it connects each layer to all preceding layers, which is a unique approach to handling the vanishing gradient problem. This dense connectivity pattern allows for efficient use of parameters and can be beneficial for complex datasets, but it does not offer the same advantages in terms of training speed and stability as residual connections 3.

**Depthwise Separable Convolutions:** DenseNet incorporates depthwise separable convolutions, which are a form of convolution that separates the learning of spatial features from the learning of channel-wise features. This can lead to a reduction in the number of parameters and computational complexity, making DenseNet more efficient than models that use regular convolutions. The use of depthwise separable convolutions in DenseNet contributes to its efficiency and can be beneficial for tasks with a large number of classes 1.

**tldr:** DenseNet offers a unique approach to handling the vanishing gradient problem and can be highly accurate for complex datasets. However, its dense connectivity pattern can lead to longer training times and larger model sizes, which should be considered when choosing a model for your scenario. The use of depthwise separable convolutions contributes to DenseNet's efficiency, making it a suitable choice for tasks with a large number of classes.

## Xception

[Keras Xception Architectures](https://www.tensorflow.org/api_docs/python/tf/keras/applications/xception)

[Keras Xception Applications](https://keras.io/api/applications/xception/)

For the **Xception** model, considering your scenario with 17,000 unique classes, here are some considerations:

**Complexity and Accuracy:** Xception has been shown to outperform other models like VGGNet, ResNet, and Inception-v3 in terms of accuracy, especially on the JFT dataset, which comprises over 350 million high-resolution images annotated with labels from a set of 17,000 classes 14. This suggests that Xception can handle complex datasets well, potentially offering high accuracy for your scenario. However, it's important to note that the performance gains are attributed to a more efficient use of model parameters rather than increased capacity 24.

**Training Time:** While specific training time comparisons between Xception and other models like DenseNet are not provided in the sources, it's noted that Xception's training on the JFT dataset took over one month 4. This indicates that Xception might require a significant amount of time to train, especially on large datasets. If training time is a concern, you might need to consider this aspect in your decision-making process.

**Model Size:** Xception has a similar model size to Inception-v3, with approximately 22,855,952 parameters 4. This suggests that Xception's model size is comparable to other state-of-the-art models, which could be a consideration if you're deploying the model on a server with limited storage. However, it's important to note that the performance gains of Xception over Inception-v3 do not come from added capacity but rather from a more efficient use of the model parameters 24.

**Residual Connections:** Xception incorporates residual connections, which have been shown to significantly improve accuracy 1. This feature might be beneficial for your scenario, especially if you're dealing with complex datasets that require deep learning models to capture intricate patterns.

**Depthwise Separable Convolution:** Xception uses modified depthwise separable convolutions, which are claimed to be more efficient than traditional convolutions 1. This could be an advantage in terms of computational efficiency, potentially allowing for faster training times or the ability to train on hardware with limited computational resources.

**tldr:** Xception offers a balance between accuracy and efficiency, with its architecture designed to handle complex datasets effectively. However, its training time and model size are comparable to other models, so these factors should be considered alongside your specific requirements and constraints.

## Compared:

This [Paper](https://www.mdpi.com/2079-9292/12/14/3132) describes how some of the models including Xception and Densenet performed in comparision to each other.

Takaways:
- Mobilenet and Resnet were strugleing with more difficult dataset
- While on recommended image size for easy dataset Mobilenet and Resnet achieved remarkable accuracy. In these ideal scenarios Densenet followed by Xception performed in very good 98% range
- On harder dataset Densenet followed by Xception outperformed Mobilenet and Resnet
- On smaller image sizes (75x75 which is smaller than models recommended), Densenet and Xception also outperformed the rest

## Setup dataset:

Import required packages

In [1]:
# Import packages
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import DenseNet121
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint, EarlyStopping

import datetime
from pathlib import Path
import os

2024-04-02 09:37:21.431384: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-02 09:37:22.650550: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-02 09:37:24.893911: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Data generators will allow us to data augmentation to artificaily create more data

In [2]:
# Define paths
train_dir = 'dataset/train'
validation_dir = 'dataset/valid'
test_dir = 'dataset/test'

# Define parameters
IMG_SIZE = (224, 224)
BATCH_SIZE = 8

# Define logs directory
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")

# Define checkpoints directory
checkpoints_dir = "checkpoints"

In [3]:
# Data generators
train_datagen = ImageDataGenerator(rescale=1./255,
                                   rotation_range=20,
                                   width_shift_range=0.2,
                                   height_shift_range=0.2,
                                   shear_range=0.2,
                                   zoom_range=0.2,
                                   horizontal_flip=True,
                                   fill_mode='nearest')

validation_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(train_dir,
                                                    target_size=IMG_SIZE,
                                                    batch_size=BATCH_SIZE,
                                                    class_mode='categorical')

validation_generator = validation_datagen.flow_from_directory(validation_dir,
                                                              target_size=IMG_SIZE,
                                                              batch_size=BATCH_SIZE,
                                                              class_mode='categorical')

test_generator = test_datagen.flow_from_directory(test_dir,
                                                  target_size=IMG_SIZE,
                                                  batch_size=BATCH_SIZE,
                                                  class_mode='categorical',
                                                  shuffle=False)

Found 84635 images belonging to 525 classes.
Found 2625 images belonging to 525 classes.
Found 2625 images belonging to 525 classes.


In [4]:
# Load the DenseNet model
base_model = DenseNet121(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Freeze the base model
base_model.trainable = False

# Add custom layers
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(train_generator.num_classes, activation='softmax')(x)

# Final model
model = Model(inputs=base_model.input, outputs=predictions)

2024-04-02 09:37:53.533259: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-02 09:37:53.761037: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


In [9]:
# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

# Define callbacks
tensorboard_callback = TensorBoard(log_dir=log_dir, histogram_freq=1)

checkpoint_callback = ModelCheckpoint(filepath=f"{checkpoints_dir}model-{{epoch:03d}}-{{val_loss:.3f}}.keras",
                                      monitor='val_loss',
                                      verbose=1,
                                      save_best_only=True,
                                      mode='auto')

early_stopping_callback = EarlyStopping(monitor = "val_loss", # watch the val loss metric
                                        patience = 5,
                                        restore_best_weights = True) # if val loss decreases for 3 epochs in a row, stop training

In [None]:
"""
# Train the model with TensorBoard callback
history = model.fit(train_generator,
                    steps_per_epoch=train_generator.samples // BATCH_SIZE,
                    epochs=4,
                    validation_data=validation_generator,
                    validation_steps=validation_generator.samples // BATCH_SIZE,
                    callbacks=[tensorboard_callback, checkpoint_callback, early_stopping_callback])
"""

The above will train the model, but on this setup it gives error messages about: 
" W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 102760448 exceeds 10% of free system memory.",
indicating that more system memory should be available for a training process. This will be fixed by default when run with more system memory.
For now its fixed by running script from a terminal limiting system memory usage.

While training this message can appear at the end of each epoch:
"Local rendezvous is aborting with status: OUT_OF_RANGE",
this message is expected behavior, and apparently does not effect performance of trained model

We can read more about how data in Tensorflow works here:
[TensorFlow input pipelines](https://www.tensorflow.org/guide/data)

For now lets assume the training process will not be done in a notebook but terminal instead. Hence for the reminder of a notebook we will used trained model.

In [None]:
# Evaluate the model
test_loss, test_acc = model.evaluate(test_generator, steps=test_generator.samples // BATCH_SIZE)
print(f'Test accuracy: {test_acc}, Test loss: {test_loss}')

In [None]:
# Save the model in TensorFlow SavedModel format
model.save('trained_models/model_1.keras')

Run Tensorboard:

In [10]:
%load_ext tensorboard
%tensorboard --logdir logs