### BTTAI x NYBG Spring 2024 AI Studio Internal Kaggle Competition
#### *Team Sweet Pea_BOS(Muya Guoji & Katie Boscombe)*
The aim of this project is to advance biodiversity research by building an ML model to categorize plant specimen images, for the New York Botanical Garden. Details about the original competition can be found here: `(https://www.kaggle.com/competitions/bttai-nybg-2024)`. Our final model accuracy rate for the test data in the competition is 98.44%. 

#### 1. Import Libraries and Modules
We start by importing necessary Python libraries and modules that will be used in this project. 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import tensorflow as tf
import keras
import tensorflow_hub as hub
from tensorflow.keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array
from keras.preprocessing import image
from keras.applications.resnet50 import ResNet50, preprocess_input, decode_predictions
from keras.applications.resnet import ResNet101, preprocess_input as preprocess_input_101, decode_predictions as decode_predictions_101
from keras.applications.resnet_v2 import ResNet152V2, preprocess_input as preprocess_input_152, decode_predictions as decode_predictions_152
from keras.models import Sequential, Model
from keras.layers import Dense, Flatten, GlobalAveragePooling2D
from keras.applications import ResNet50
from keras.utils import to_categorical
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau, TensorBoard

#### 2. Data Preparation

Data is then loaded from CSV files containing training, validation, and test sets. The directory paths for images corresponding to each set are specified in the original kaggle competition. DataFrames are manipulated to ensure class IDs are read as strings.

In [2]:
train_df = pd.read_csv("/kaggle/input/bttai-nybg-2024/BTTAIxNYBG-train.csv")
train_dir = "/kaggle/input/bttai-nybg-2024/BTTAIxNYBG-train/BTTAIxNYBG-train"
validation_df = pd.read_csv("/kaggle/input/bttai-nybg-2024/BTTAIxNYBG-validation.csv")
validation_dir = "/kaggle/input/bttai-nybg-2024/BTTAIxNYBG-validation/BTTAIxNYBG-validation"
test_df = pd.read_csv("/kaggle/input/bttai-nybg-2024/BTTAIxNYBG-test.csv")
test_dir = "/kaggle/input/bttai-nybg-2024/BTTAIxNYBG-test/BTTAIxNYBG-test"

#### 3. Image Data Preprocessing and Augmentation

Data augmentation is a technique used to artificially expand the diversity of a dataset by applying random, yet realistic, transformations to the training images. This process aims to prevent the model from overfitting and aids in generalizing better to new, unseen data. We implemented a few specific augmentation techniques applied to the training dataset:

*Rescaling*: Each pixel in the image is rescaled by a factor of 1/255. This transformation converts the pixel values from a range of 0-255 to 0-1, making the neural network's task of learning from these data easier and more stable.

*Horizontal Flip*: Images are randomly flipped horizontally. This augmentation assumes that a horizontal flip does not change the semantic meaning of the image, which is generally a safe assumption in natural scenes and objects like plants.

*Rotation*: Images are randomly rotated by a degree in the range of 0 to 20 degrees. This mimics the variation in orientation that can occur with camera angles or object placement, enhancing the robustness of the classifier.

*Shear Transformation*: A shear transformation slants the shape of an image, simulating a change in viewing angle. Here, a shear intensity of 0.1 (shear angle in degrees) introduces slight distortions in the image geometry.

Each of these transformations introduces variability into the dataset but retains the essential features necessary for accurate classification, to  equip the model to handle real-world variability in inputs.

In [3]:
train_df['classID'] = train_df['classID'].astype(str)
validation_df['classID'] = validation_df['classID'].astype(str)
items = os.listdir(train_dir)
num_items = len(items)

def pre_processing():
    datagen = ImageDataGenerator(
        rescale=1./255,
        horizontal_flip=True,
        rotation_range=20,
        shear_range=0.1,  # Shear intensity (shear angle in degrees)
    )
    return datagen

datagen = pre_processing()


In [4]:
train_generator = datagen.flow_from_dataframe(
    dataframe = train_df,
    directory = train_dir,
    x_col = "imageFile",
    y_col = "classID",
    target_size=(224, 224),
    batch_size=32,
    class_mode='categorical',
)

Found 81946 validated image filenames belonging to 10 classes.


As for validation data processing, the only transformation applied to the validation data is to rescale it by a factor of 1/255, stabilizes the data values from a range of 0-255 to 0-1, just like the training dataset, so that the performance is assessed on minimally altered images, maintaining the integrity and reliability of the evaluation process.

In [5]:
# For the validation data, only apply rescaling
validation_datagen = ImageDataGenerator(rescale=1./255)

validation_generator = validation_datagen.flow_from_dataframe(
    dataframe=validation_df,
    directory=validation_dir,
    x_col="imageFile",
    y_col="classID",
    target_size=(224, 224),
    batch_size=32,
    class_mode='categorical'
)


Found 10244 validated image filenames belonging to 10 classes.


#### 4. Model building 
A ResNet101 base model loaded with pretrained ImageNet weights is used to classify the images. We implemented additional layers to allow the model to transition from understanding generic image features (learned from ImageNet) to interpreting and classifying images based on the specific characteristics of the NYBG image dataset. They adapt the model from broad image recognition tasks to a focused classification problem, enabling it to perform with higher accuracy and relevance to the target task. 

1. GlobalAveragePooling2D:
This layer is used to reduce the spatial dimensions of the input feature maps. Instead of flattening the feature maps and potentially losing some of the spatial features, global average pooling computes the average value for each feature map, retaining the most essential signal of each feature with fewer parameters. This helps in reducing the model complexity and the risk of overfitting, making the network more efficient.

2. Dense Layer (1024 units, ReLU activation):
After the feature reduction through pooling, the model needs to learn which features are most relevant in distinguishing between the different classes specific to the NYBG dataset. A dense layer with 1024 neurons and ReLU activation is introduced to serve this purpose. This layer provides the model with a substantial amount of trainable parameters, allowing it to learn complex patterns deeply from the pooled features.

3. Output Dense Layer (10 units, Softmax activation):
The final layer in the architecture is a dense layer with a number of units equal to the number of classes to be predicted (in this case, 10). This layer uses the softmax activation function to output a probability distribution across the 10 classes, where each value represents the likelihood of the input image belonging to one of the classes. This layer essentially maps the learned features to specific class predictions.

In [6]:
#resnet101 model
base_model = ResNet101(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Add additional layers on top of the base model
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)

# Create the full model
model = Model(inputs=base_model.input, outputs=predictions)

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet101_weights_tf_dim_ordering_tf_kernels_notop.h5
[1m171446536/171446536[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 0us/step


#### 5. Training and Callbacks

The model's training incorporates callbacks for early stopping and learning rate reduction, both of which are contingent on the validation loss to optimize training efficiency. The learning rate reduction callback activates when there is no improvement in a monitored metric, specifically adjusting the learning rate downward. This adjustment facilitates more precise training by enabling the model to make smaller, more deliberate updates in its parameters, particularly useful when progress stalls or reaches a plateau.

In [7]:
# Define learning rate scheduler
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, min_lr=1e-6)

# early stopping 
callback = keras.callbacks.EarlyStopping(
    monitor='val_loss',  # Monitor validation loss
    patience=5,  # Wait for 5 epochs with no improvement
    restore_best_weights=True,  # Restore model weights to the best observed
    verbose=1  # Show messages about early stopping
)

model.fit(
    train_generator, 
    epochs=15,
    callbacks=[callback, reduce_lr],
    validation_data=validation_generator,
)


Epoch 1/15


  self._warn_if_super_not_called()
I0000 00:00:1712545274.021778      70 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m2561/2561[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2867s[0m 1s/step - accuracy: 0.8270 - loss: 0.5556 - val_accuracy: 0.7896 - val_loss: 0.6426 - learning_rate: 0.0010
Epoch 2/15
[1m2561/2561[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2027s[0m 790ms/step - accuracy: 0.9367 - loss: 0.1870 - val_accuracy: 0.8614 - val_loss: 0.4091 - learning_rate: 0.0010
Epoch 3/15
[1m2561/2561[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2239s[0m 872ms/step - accuracy: 0.9482 - loss: 0.1525 - val_accuracy: 0.8526 - val_loss: 0.5000 - learning_rate: 0.0010
Epoch 4/15
[1m2561/2561[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2174s[0m 847ms/step - accuracy: 0.9551 - loss: 0.1304 - val_accuracy: 0.8995 - val_loss: 0.3210 - learning_rate: 0.0010
Epoch 5/15
[1m2561/2561[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2338s[0m 911ms/step - accuracy: 0.9611 - loss: 0.1152 - val_accuracy: 0.8263 - val_loss: 0.6602 - learning_rate: 0.0010
Epoch 6/15
[1m2561/2561[0m [32m━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x7b63f4425ba0>


After training the neural network, it is crucial to save the trained model to allow for later use without the need to retrain from scratch (model serialization). This process is handled in this segment of the code using TensorFlow's Keras API. 

In [8]:
# save model 
from datetime import datetime
current_datetime = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
model.save("/kaggle/working/model_" + current_datetime + ".h5")
# save model again 
current_datetime = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
model.save("/kaggle/working/model_" + current_datetime + ".keras")

#### 6. Loading & Testing the Model with Test Data

Import the test data into the trained model to generate the final result.

In [9]:
test_df = pd.read_csv("/kaggle/input/bttai-nybg-2024/BTTAIxNYBG-test.csv")
test_dir = "/kaggle/input/bttai-nybg-2024/BTTAIxNYBG-test/BTTAIxNYBG-test"

test_generator = datagen.flow_from_dataframe(
    dataframe = test_df,
    directory = test_dir,
    x_col="imageFile",
    target_size=(224, 224),
    batch_size=32,
    class_mode=None, 
    shuffle=False  
)


Found 30690 validated image filenames.


In [10]:
predictions = model.predict(test_generator)

  self._warn_if_super_not_called()


[1m960/960[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m910s[0m 939ms/step


In [11]:
results_df = test_df.copy(deep = True)

In [12]:
results_df['predicted_class'] = ""
predicted_classes = np.argmax(predictions, axis=1)
results_df['predicted_class'] = predicted_classes

In [13]:
results_df

Unnamed: 0,uniqueID,imageFile,predicted_class
0,1,facd4dcd8e869617.jpg,1
1,9,78c96bb2b2b62579.jpg,9
2,10,d292d2c4e0e6ad9d.jpg,4
3,14,3633494929870713.jpg,1
4,16,dc94b496c8e2d6c4.jpg,6
...,...,...,...
30685,122864,9ab2ba9a949abab2.jpg,7
30686,122868,ccccede8cccccc4f.jpg,1
30687,122871,31ccec6c99ccec68.jpg,0
30688,122878,de1e0f1f0e0e9e9e.jpg,1


In [14]:
submit_df = results_df[['uniqueID','predicted_class']].copy(deep=True)
submit_df.rename(columns={'predicted_class': 'classID'}, inplace=True)
submit_df.to_csv('submission' + current_datetime + '.csv', index=False)