# Stanford Dogs 🐶 - A Classfication problem
Our group decided to tackle a project that sparked everyone's interest: using the Stanford Dogs dataset to create an image classification model capable of accurately identifying dog breeds. We wanted to make it personal and engaging by trying to identify the breed of a friend's dog.

<-- *INSERT PICTURE HERE!* -->

From a business perspective, our goal is to develop an image classification model with high accuracy in identifying dog breeds. This has various practical applications, such as enhancing pet adoption platforms by providing precise breed information, aiding veterinary services with breed-specific medical advice, and enabling personalized pet care products tailored to different breeds.

We formulated the problem as a supervised learning task, leveraging the labelled images in the Stanford Dogs dataset. Our aim is to train a model that learns to distinguish the characteristics of each breed and applies this knowledge to predict the breed of new, unseen images.

To evaluate the performance of our model, we will use several metrics. ``Accuracy`` will be our primary metric, indicating the proportion of correctly identified breeds out of all predictions. We'll also assess ``precision``, ``recall``, and the ``F1-score`` to gain deeper insights into the performance for each breed, particularly if the dataset is imbalanced. Additionally, a confusion matrix will help visualize the model's performance and pinpoint misclassifications, enabling us to refine and enhance the model further.

## 02 Get the data
- Find and document where you can get the data from
- Get the data
- Check the size and type of data (time series, geographical etc)

### 00 Getting the data 🗃️

In terms for getting the actual data, we found the dataset itself on Kaggle, a platform for data scientists and machine learning enthusiasts. It hosts a vast array of datasets, including the [Stanford Dogs dataset](https://www.kaggle.com/datasets/jessicali9530/stanford-dogs-dataset/code), which is a widely used benchmark for image classification tasks. 

The dataset consits of over 20,000 images of dogs, encompassing 120 different breeds, making it an ideal resourcs for training and evaluation our model.

**NOTE**: It requires an Kaggle account to download the dataset from their website

## 03 Exploring the Data
- Create a copy of the data for explorations (sampling it down to a manageable size, if necessary)
- Create a Jupyter notebook to keep a record of ytour data exploration
- Study each feature and its characteristics:
    - Name
    - Type
    - Percentage of missing values
    - Check for outliers, rounding errors etc.
- For supervised learning tasks, identify the target(s)
- Visualize the data
- Study the correlation between features
- Identify the promising transformations you may want to apploy (e.g. convert skewed targets to normal via. a `log` transformation)
- Document what you have learned

### 01 Exploring the dataset 🔍

Now that we have the dataset in our possion, we can take a closer look at it. The dataset itself consists more directly of **images** and **annotations**.

The **images** are the actually images of the different breeds stored as `.jpg`-files.
The **annotations** seems to be some kind of `.xml`-files, which contains information about where the dogs are located in the images and what breed the dog is.


In [6]:
# TODO: Look at the first image in the dataset and print out the dimensions of the image and the number of channels

# TODO: Look at the first annotation in the dataset and print out the number of objects in the annotation

# TODO: Make a histogram of the total number of images for each dog breed in the dataset. (This will help us determine if we need to look at F1 score instead of accuracy)

# ANY OTHER IDEAS?


## 04 Prepare the data
Notes:
- Work on copies of the data (keep the original dataset intact)
- Write functions for all data transformation you apply, for three reasons:
    - So you can easily prepare the data next time you run your code
    - So you can apply these transformations in future projects
    - To clean and prepare the test set.


1. Data cleaning:
    - Fix or remove outliers (or keep them)
    - Fill in the missing values (e.g. with zero, mean, median, regression....) or drop their rows (or columns)
2. Feature selection (optional):
    - Drop the features that provide no useful information for the task (e.g. a customer ID is usually useless for modelling).
3. Feature engineering, where appropriate:
    - Discretize continuos features
    - Use one-hot encoding if/when relevant
    * Add promising transformations of features (e.g. $\log(x)$, $\sqrt{x}$, $x^2$, etc)
    * Aggregate features into promising new features
4. Feature scaling: standardise or normalise features

### 02 Preparing the data for training 🛠️

<-- *Write something here* -->

In [8]:
# TODO: Include the script to rename all the folders? (I think this is a good idea)

# TODO: ANY OTHER IDEAS?

<-- *Explanation for the below approach* -->

In [1]:
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import os
import scipy

# Define paths
images_dir = 'images'

# Create the ImageDataGenerator data generator
datagen = ImageDataGenerator(
    rescale=1./255,
    validation_split=0.2, # 20% of the data will be used for validation
    horizontal_flip=True,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    zoom_range=0.2,
)

# Load all images to be used for the training set.
train_generator = datagen.flow_from_directory(
    images_dir,
    target_size=(224, 224),
    batch_size=32,
    class_mode='categorical',
    subset='training',
    shuffle=True,
    seed=42
)

# Load all images to be used for the validation set.
validation_generator = datagen.flow_from_directory(
    images_dir,
    target_size=(224, 224),
    batch_size=32,
    class_mode='categorical',
    subset='validation',
    shuffle=True,
    seed=42
)



Found 16508 images belonging to 120 classes.
Found 4072 images belonging to 120 classes.


####  Compiling the model ⚙️

The next step in the process is to compile the model itself. But before that we have define what **Loss function**, **Optimizer** and **Metrics** we are going to be using on this model.

For the **Loss function** We have a few different options:

(*Name a few different loss functions that would make sense to use for this project.*)

For the **Optizimers** we also have a few different options:
- *Adam*, *SGD*, *RMSProp* etc.

For the **Metrcis** we also have a few different options:
- *Accuarcy*, *PRecision*, *Recall*, *F1 score* etc.


In [2]:
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D, Dropout
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.regularizers import l2

# ? Load pre-trained model, if available
if os.path.exists('model.h5'):
    model = tf.keras.models.load_model('model.h5')

# ? Otherwise, we need to create a new instance of the model.
else:
    # Load the ResNet50 model, pre-trained on ImageNet
    base_model = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

    # Add custom layers on top of the base model
    x = base_model.output
    x = GlobalAveragePooling2D()(x)
    x = Dense(1024, activation='relu', kernel_regularizer=l2(0.01))(x)
    x = Dropout(0.5)(x)
    predictions = Dense(train_generator.num_classes, activation='softmax')(x)

    # Define the model
    model = Model(inputs=base_model.input, outputs=predictions)

    # Unfreeze the last few layers of the base model
    for layer in base_model.layers[-10:]:
        layer.trainable = True

    # Compile the model
    model.compile(optimizer=SGD(learning_rate=0.0001, momentum=0.9), loss='categorical_crossentropy', metrics=['accuracy'])


#### Training the model 🏋️‍♀️

The next step in the process is to train the now compiled model on our data. Here we also have a little exploratory work in figuring out:
- What *batch size* should we use?
- What *number of epochs* should we use?
- Is the model *overfitting* or *underfitting*?



In [4]:
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint

# - Function the limit the number of batches per epoch for faster iterations.
def limit_batches(generator, max_batches):
    while True:
        for i, (x_batch, y_batch) in enumerate(generator):
            if i >= max_batches:
                break
            yield (x_batch, y_batch)

# * Current limits:
max_train_batches = 100 # It's a good starting point, but needs to be adjusted for better results.
max_validation_batches = 25 # It's a good starting point.

# ? Callbacks and their usage

# 1. Reduce learning rate when a metric has stopped improving.
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, min_lr=0.00001, verbose=1)
# 2. Stop training when a monitored quantity has stopped improving.
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True, verbose=1)
# 3. Save the model after every epoch.
model_checkpoint = ModelCheckpoint('model.h5', save_best_only=True, save_weights_only=True, monitor='val_loss', mode='min', verbose=1)

# ! 1st round of training
history = model.fit(
    limit_batches(train_generator, max_train_batches),
    validation_data=limit_batches(validation_generator, max_validation_batches),
    epochs=20, # Use a small number of epochs to speed up the process (10 epochs = 5 mins on GPU - With validation accuracy of 0.18 after 10 epochs)
    steps_per_epoch=max_train_batches,
    validation_steps=max_validation_batches
)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [5]:
val_loss, val_accuracy = model.evaluate(validation_generator, steps=validation_generator.samples // validation_generator.batch_size)
print(f'Validation accuracy: {val_accuracy * 100:.2f}%')

# Save the model
model.save('model.h5')

Validation accuracy: 72.05%


## 05 Short-list promising models
We expect you to do some additional research and train at **least** one model per team member!

1. Train mainly quick and dirty models from different categories (e.g. linear, SVM, Random Forests etc.) using default parameters
2. Measure and compare their performance
3. Analyse the most significant variables from each algorithm.
4. Analyse the types of errors the models make
5. Have a quick round of feature selection and engineering if necessary
6. Have one or two more quick iterations of the five previous steps.
7. Short-list the top three to five most promising models, prefeering models that make *different* types of errors?

## 06 Fine-tune the system

1. Fine-tune the hyperparameters
2. Once you are confident about our final model, measure its performance on test set to estimate the generalisation error.


## 07 Present your solution

1. Document what you ahve done
2. Create a *nice* 15 minute video presentation with slides
    - Make sure you highlight the big picture first.
3. Explain why your solutions achieves the business objective
4. Don't forget to present interesting points you noticed along the way:
    - Describe what worked and what did not.
    - List your assumptions and your model's limitations.
5. Ensure your key finds are communicated through nice visualisations or easy-to-remember statements (e.g. "*The median income is the number-one predictor of housing prices*")
6. Upload the presentation to some online platform, e.g. Youtube, Vimeo, and supply a link to the video in the notebook.