

This notebook is shows models trained on a subset of iNaturalist dataset.  It contains the following sections:

*   A description of the selected conventional ML model and deep learning model;
*   Some notes about the choices made in building the conventional ML model and deep learning model;
*   A discussion of the performance of the two models.
*   A reflection section stating major takeaways from this exercise


# Conventional ML Model

The final model that produced the best-performing predictions for the Kaggle submission (accuracy 45.5% for coarse grained and 5% for fine grained) was a bagging classifier that uses decision trees with 400 estimators and 400 max sample count. The image data was normalized with standardScaler beforehand.

In [None]:
# Taking the data in using the provided code stub
X,Y = create_dataset_sklearn('train', fine_grained=fine_grained, percent=0.1)

Code below is used for normalization

In [None]:
# Necessary import
from sklearn.preprocessing import StandardScaler
# preprocessing
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)
X = X_normalized
X_val_normalized = scaler.fit_transform(X_val)
X_val = X_val_normalized

The below code defines the final model:

In [None]:
# Necessary imports
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
# model definition code
clf_Bagging = BaggingClassifier(
DecisionTreeClassifier(random_state = 42), n_estimators = 400,
max_samples = 400, bootstrap = True, n_jobs = -1, random_state = 42)
clf_Bagging.fit(X, Y)
y_pred_Bagging = clf_Bagging.predict(X_val)

# Notes on the Conventional ML Model

For the final model, the number of estimators and samples were chosen by trail and error. Going above the used number did not provide any significant increase in model accuracy and going below decreases accuracy. (Below accuracy numbers are for coarse grained set)

In addition to this model, Quite a few other models were tried. These include logistic regression with 500 iteration, naive bayes, both of these models had accuracy around 23-25%. Furthermore, Randomforest with 500 estimators and max leaf node of 3  and Adaboost classifier was used in another instance of random forests, both of these produced 42% accuracy. Also performed K-nearest neighbor, The performance on this was disappointing 23%. Thus, The final model was chosen.

# Deep Learning Model

The final model that produced the best-performing predictions for the Kaggle submission (accuracy (67+5)% coarse grained, 25% on fine grained) was a model using resnet50 as the base and a fully connected dense layer with 1024 neurons, followed by a dropout layer and output softmax layer.

In [1]:
# Loading data using the provided code stub
train_ds = create_dataset_tf('train', fine_grained=fine_grained, batch_size=batch_size)
val_ds = create_dataset_tf('val', fine_grained=fine_grained, batch_size=batch_size)

In [None]:
# Load the ResNet50 model without the top (fully connected) layers
base_model = tf.keras.applications.ResNet50(weights='imagenet', include_top=False, classes = 8,input_shape=(112, 112, 3))
data_augmentation = tf.keras.Sequential(
    [tf.keras.layers.RandomFlip(mode="horizontal", seed=42),
     tf.keras.layers.RandomRotation(factor=0.05, seed=42)]
)
# Freeze the layers in the base model
for layer in base_model.layers:
    layer.trainable = False

model = tf.keras.Sequential([
    data_augmentation,
    base_model,
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(units=1024, activation="relu", kernel_regularizer=tf.keras.regularizers.l2(0.01)),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(units=8, activation="softmax") # Change to units=50 for fine grained
])
model.build(input_shape=(None, 112, 112, 3))

# Print the summary of the model
model.summary()

In [None]:
from functools import partial
# Following optimization was used for initial training with base model layers frozen
CustomAdam = partial(tf.keras.optimizers.Adam, learning_rate=0.0001)
model.compile(optimizer=CustomAdam(), loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Saving initial weights
checkpoint_path = "/gdrive/MyDrive/comp8220data/fullinitialize"

# Create a callback that saves the model's weights
cp_callback_initial = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1)
# Initial training
history = model.fit(train_ds,validation_data=val_ds, epochs = 2, callbacks=[cp_callback_initial],verbose=1)

In [None]:
# Unfreeze the layers in base model
for layer in base_model.layers[50:]:
    layer.trainable = True
    
# Compile the model again with different learning rate
CustomAdam = partial(tf.keras.optimizers.Adam, learning_rate=0.00003)
model.compile(optimizer=CustomAdam(), loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Saving weights for later use
checkpoint_path = "/gdrive/MyDrive/comp8220data/fulltest1"

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1)
# Training the model
history = model.fit(train_ds,validation_data=val_ds, epochs = 20, callbacks=[cp_callback], verbose=1)

# Notes on the Deep Learning Model

For the final model, hyperparameters were chosen by trail and error. Turning the learning rate anything higher causes the model to zic zac all over in validation loss and accuracy during training. Anything lower causes the model to take too long (more than 40 epochs) to converge. 

In addition to the final model, I also tried a CNN with six conv layers (with filters 2x(100/200/400)), four maxPool layers and one hidden dense layer, one output layer. This model performed very similar to the final model (68% in coarse grained) but didn't provide significant performance benifit. On the final model, Adadelta optimizer with learning rate 1.0 was tried as well. It provides a smoother curve on validation set but does not provide any performance benifit. All these models plateaued at 67% +5% accuracy. This tells that there is a potential major modification required (possibly on the training data) to get a breakthrough.

# Discussion of Model Performance and Implementation

Comparing my final conventional ML and deep learning models, the deep learning one performed better by 23% on the public test set.  The deep learning model ranked #35 out of N submissions on the public test set, with the top-performing system having 90%+ accuracy on the coarse grained set. 

The performance of the model on public set as well as private test set go in line with the performance in validation set. This suggest there isn't much overfitting. However, the low performance indicate there maybe underfitting in some areas.

Unfortunately, The issue of class imbalance in the data which I should have figured out early on, I missed that part which is the likely reason for under performing model (An oversight). A potential solution that can be tried here is adjusting class weight to give more weight to minority classes and less weight to majority classes.

### Reflections
The dataset definitely needs to be explored before trying to fit models. One of the issue I have noticed during training deep learning models is the resource limitation. The free version of google colab is definitely not ideal due to the arbritary undisclosed gpu usage limit placed there. I had to rotate between two google accounts to continue working from time to time. Another issue there is that the runtime is disconnected after about an hour of inactivity from user, which is not ideal considering models can take hours to train and sometimes the runtime is disconnected during training. This issue can be bypassed by saving weights as done in the above deep learning code. When there's major issue in the dataset like class imbalance, It seems no amount of model tweaking can resolve the issue. The tweakings I performed included changing the batch size, using SGD optimizer with different learning rates, Adadelta with different learning rates and Adam with different learning rates.  Tweaking model complexity involved increasing/decreasing number of neurons and adding more layers to the model. None of these overcame that 68+-2% accuracy of the model.
