# **Multiclass dog breed classification**

This notebook builds an end-to-end image classifier using TensorFlow and TensorFlow Hub.

## 1. Problem

Creating a machine learning model that recognizes and classifies dog breeds given an image of a dog.

## 2. Data

The data comes frome Kaggle's 'Dog Breed Identification' competition: https://www.kaggle.com/competitions/dog-breed-identification/data.

It consists of a training set (10222 samples with labels) and a test set (10.4k samples) of images of dogs. Each image has a filename that is its unique id. The dataset comprises 120 breeds of dogs.

## 3. Evaluation

Model performance is evaluated through a file that contains the predicted classification probability of each sample.

## Workspace preparation

In [None]:
#importing the necessary libraries
import tensorflow as tf
import tensorflow_hub as hub
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Getting the data ready

In [None]:
#Importing the csv file that contains the labels of the samples in the training set
labels_csv = pd.read_csv('/content/drive/MyDrive/Dog breed classification project/labels.csv')
print(labels_csv.head())

In [None]:
#we have 120 breeds and 10222 unique samples
print(labels_csv.describe())
#We have no missing data
print(labels_csv.info())

In [None]:
#Number of samples of each breed
labels_csv['breed'].value_counts().plot(kind='bar', figsize=(18,8))
plt.title("Number of samples for each breed");

In [None]:
#Median number of samples per breed
labels_csv['breed'].value_counts().median()

In [None]:
#Visualizing one of the sample images
from IPython.display import Image
Image('/content/drive/MyDrive/Dog breed classification project/train/0021f9ceb3235effd7fcde7f7538ed62.jpg')

In [None]:
#Creating a list of the paths to each of the images
filenames = ['/content/drive/MyDrive/Dog breed classification project/train/' + names for names in labels_csv['id'] +'.jpg']
#Fisrt 5 elements of the list
filenames[:5]

In [None]:
#Verifying that the number of samples in the train folder are the same as in labels_csv
import os
len(os.listdir('/content/drive/MyDrive/Dog breed classification project/train')) == len(labels_csv)

## Preparing the labels

In [None]:
#turning the labels into an array
labels = labels_csv['breed'].to_numpy()
labels

In [None]:
#Find the unique label values
unique_breeds = np.unique(labels)

In [None]:
#Turn every label into a boolean array
boolean_array = np.array([label == unique_breeds for label in labels])
boolean_array

In [None]:
boolean_array.shape

In [None]:
print(labels[0])
#code to find the index where a label is found in unique_labels
print(np.where(labels[0]==unique_breeds))
#Index where label occurs in boolean_array
print(boolean_array[0].argmax())
#How to one hot encode a boolean_array
boolean_array[0].astype(int)

In [None]:
encoded_labels = boolean_array.astype(int)
encoded_labels[0]

## Setting up the training and validation sets

In [None]:
X = filenames
y = encoded_labels

We'll start with 1000 images and then take them all once we have evaluated model performance

In [None]:
#Set the number of images for experimenting
NUM_IMAGES = 1000 # @param {type:"slider", min:1000, max:10000, step:1000}

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X[:NUM_IMAGES], y[:NUM_IMAGES], test_size=0.2, random_state=1)

len(X_train), len(X_val), len(y_train), len(y_val)

## Preprocessing images: turning images into tensors

We will create a function that:
1. Takes an image filepath as input.
2. Uses tensorflow to read the file and save it as a variable.
3. Turns the image (jpeg) into tensors.
4. Normalizes the image.
5. Resizes the image to a shape of (224,224). This is because the transfer learning model was trained with that shape.
6. Return the modified image.  

In [None]:
#Converting an image to a numpy array
image = plt.imread(filenames[4])
image.shape

In [None]:
#an image consists of red green and blue values between 0 and 225 for pixels
image.max(), image.min()

In [None]:
#Transforming the image into a tensor
t_image = tf.constant(image)
t_image[0]

### Preprocessing function

In [None]:
IMG_SIZE = 224
def preprocess_image(filepath, size=IMG_SIZE):

  #reading the image as a tensor of type "string"
  image = tf.io.read_file(filepath)
  #turn the jpeg image into tensors with three color channels (red, green, blue)
  image = tf.image.decode_jpeg(image,channels=3)
  #Normalizing the values
  image = tf.image.convert_image_dtype(image, tf.float32)
  #Resizing the image
  image = tf.image.resize(image,size=[IMG_SIZE,IMG_SIZE])

  return image

In [None]:
#preprocess example
preprocess_image(filenames[1]).shape

## Turning the data into batches

In [None]:
#Creating a function that returns a tuple of (preprocessed_image, label)
def get_image_label(filepath,label):
  image = preprocess_image(filepath)
  return image, label

In [None]:
#Defining the batch size
BATCH_SIZE = 32
#Create a function to create data into batches
def create_data_batches(X,y=None, batch_size= BATCH_SIZE, valid_data=False,test_data=False):
  '''
  Create batches of data out of image (X) and label (y) pairs.
  It shuffles the data if it is the training data, and doesn't shuffle if
  it's validation data. It also accepts test data as input (no labels).
  '''
  #If the data is the test dataset, we likely won't have labels:
  if test_data:
    print('Creating test data batches...')
    data = tf.data.Dataset.from_tensor_slices(tf.constant(X)) #only filepaths, no labels
    data_batch = data.map(preprocess_image).batch(batch_size)
    print('batch created')
    return data_batch

  #If the data is a validation dataset, we don't need to shuffle it
  if valid_data:
    print('Creating validation data batches...')
    data = tf.data.Dataset.from_tensor_slices((tf.constant(X), #filepath
                                               tf.constant(y))) #labels
    #This time we use the get_image_label function because we are also working with labels
    data_batch = data.map(get_image_label).batch(batch_size)
    print('batch created')
    return data_batch

  else:
    print('Creating training data batches...')
    data = tf.data.Dataset.from_tensor_slices((tf.constant(X), tf.constant(y)))
    #shuffling the data before mapping
    #The shuffle is done before preprocessing because it is more computationally efficient this way
    data = data.shuffle(buffer_size=len(X))
    data_batch = data.map(get_image_label).batch(batch_size)
    print('batch created')
    return data_batch


In [None]:
train_data = create_data_batches(X_train,y_train)
val_data = create_data_batches(X_val,y_val, valid_data=True)

In [None]:
#Now our data is in a dataset
train_data

## Visualizing data batches

In [None]:
#Creating a function for viewing images in a data batch
def show_images(images, labels):
  '''
  Displays a plot of 25 images and their labels from a data batch
  '''
  #Setup the figure
  plt.figure(figsize=(12,12))
  #Loop through 25
  for i in range(25):
    #Create subplots
    ax = plt.subplot(5,5,i+1)
    #Display image
    plt.imshow(images[i])
    #Add the label as the image title
    plt.title(unique_breeds[labels[i].argmax()])
    #Turning grid lines off
    plt.axis('off')

In [None]:
#Unbatching the images
train_images, train_labels = next(train_data.as_numpy_iterator()) #next takes the top batch of the iterator

In [None]:
len(train_images), len(train_labels)

In [None]:
#Using the visualization function on the train dataset
show_images(train_images, train_labels)

## Bulding a model

Previous steps:
* The input shape (our images shape, in the form of Tensors) of our model.
* The output shape (image labels, in the form of Tensors) of our model.
* The URL of the model we want to use from TensorFlow Hub - https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/4

In [None]:
IMG_SIZE

In [None]:
#Setting the input shape
INPUT_SHAPE = [None, IMG_SIZE,IMG_SIZE,3]
#Setting the output shape
OUTPUT_SHAPE = len(unique_breeds)
#Model URL
MODEL_URL = "https://www.kaggle.com/models/google/mobilenet-v2/frameworks/TensorFlow2/variations/140-224-classification/versions/2"
#first one: "https://kaggle.com/models/google/mobilenet-v2/frameworks/TensorFlow2/variations/130-224-classification/versions/1"

Creating a function that:
* Takes the input and output shapes and the model as parameters.
* Defines the layers in a Keras model sequentially
* Compiles the model
* Builds the model
* Returns the model

Documentation: https://www.tensorflow.org/guide/keras

In [None]:
#Creating a function that creates a Keras model

def create_model(model_url=MODEL_URL, input_shape=INPUT_SHAPE,output_shape=OUTPUT_SHAPE):

  #Setup the model layers
  model = tf.keras.Sequential([
      hub.KerasLayer(model_url), #Input Layer
      tf.keras.layers.Dense(units=1000, activation ='relu'), #Hidden layer
      tf.keras.layers.Dropout(0.5),
      tf.keras.layers.Dense(units=output_shape,activation='softmax') #Output layer
                              ])

  #Compiling the model
  model.compile(
      loss=tf.keras.losses.CategoricalCrossentropy(),
      optimizer=tf.keras.optimizers.Adam(),
      metrics= ['accuracy']
  )
  #Building the model
  model.build(input_shape)

  return model

In [None]:
model = create_model()

In [None]:
model.summary()

## Creating callbacks

Callbacks are helper functions that a model uses during training to save/check its progress or stop training early if the model stops improving.

We'll create two callbacks, one for TensorBoard, which helps track our models progress, and another for early stopping, which prevents our model from training for too long.

### Tensorboard callback
three things are needed:
1. Load up the tensorboard notebook extension
2. Create a tensorboard callback that can save logs to a directory
3. Pass the tensorboard callback to the model's fit function
4. Visualize the model's training log with the %tensorbard magic *function*

In [None]:
#Load tensorboard notebook
%load_ext tensorboard

In [None]:
import datetime

#Create a function to build a tensorboard callback
def create_tensorboard_callback():
  #Create a log directory for storing tensorboard logs
  logdir = os.path.join('/content/drive/MyDrive/Dog breed classification project/logs',
                        #make it so that the logs get tracked whenever we train the model
                        datetime.datetime.now().strftime("%d%m%Y-%H%M%S"))
  return tf.keras.callbacks.TensorBoard(log_dir=logdir)

### Creating an early stopping callback

This callback stops the model from overfitting by stopping training if a certain evaluation metric stops improving.

In [None]:
#Creating an early stopping callback
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=2)

## Training the model

The first model will only train on 1000 images to make sure that everything is working properly.

In [None]:
#Number of epochs
EPOCHS = 100 #@param {type:'slider', min:10, max:100, step:10}

In [None]:
#Checking to make sure the GPU is available
tf.config.list_physical_devices()

Creating a function that trains the model

* Creates a model with the create_model function
* Creates a tensorboard callback with create_tensorboard_callback
* Calls the fit function of the model
* Returns the model

In [None]:
#Function to create, train, and return a trained model

def train_model():
  #Create model
  model = create_model()

  #Create a new tensorboard callback every time we train a model
  tensorboard = create_tensorboard_callback()

  #Train the model
  model.fit(x=train_data,
            epochs= EPOCHS,
            validation_data=val_data,
            validation_freq=1,
            callbacks= [tensorboard,early_stopping])
  return model

In [None]:
#Fit model to the data
model = train_model()

## Making and evaluating predictions with a trained model

In [None]:
predictions = model.predict(val_data, verbose=1)

In [None]:
#Each prediction is array with 120 values, one for the probability of each class
predictions[0]

In [None]:
#Function for getting the labels out of a prediction
def get_prediction_label(prediction):
  '''
  Turns an array of prediction probabilities into a label
  '''
  return unique_breeds[np.argmax(prediction)]

#example
get_prediction_label(predictions[0])

In [None]:
#Create a function to unbatch a batched dataset
def unbatch_dataset(dataset):
  '''
 Takes a batched data set, unbatches it,
 and returns lists of the images and labels
 '''
  ub_images = []
  ub_labels = []
  for image, label in dataset.unbatch().as_numpy_iterator():
    ub_images.append(image)
    ub_labels.append(label)
  return ub_images, ub_labels

In [None]:
val_images, val_labels = unbatch_dataset(val_data)

We'll create a function that plots the predicted label, its predicted probability, and the target image on a single plot.

In [None]:
def plot_prediction(pred_prob, true_label, image, n=0):

  #pred label
  pred_label = get_prediction_label(pred_prob[n])
  actual_label = unique_breeds[true_label[n].argmax()]

  #Color title depending on prediction correctness
  if pred_label == actual_label:
    color = 'green'
  else:
    color = 'red'

  #plot image and remove image
  plt.imshow(image[n])
  plt.xticks([])
  plt.yticks([])
  plt.title(f"True breed: {actual_label},  predicted breed: {pred_label},  probability: {np.max(pred_prob[n])*100:.2f}%", color=color, fontsize=10)


In [None]:
#The prediction is correct, even though the probability is low
#This means the model is not too sure
plot_prediction(predictions,val_labels, val_images, n=1)

In [None]:
#Creating a function that plots the top ten prediction confidences for a single prediction
def plot_pred_conf(predictions, labels, n=1):
  pred_prob, true_label = predictions[n], labels[n]

  #The predicted label
  pred_label = get_prediction_label(pred_prob)

  #top ten prediction confidence indexes
  top_ten_indexes = pred_prob.argsort()[-10:][::-1] #[-10:] gets the last ten and [::-1] orders them in descending order
  #top ten prediction confidence values
  top_ten_values = pred_prob[top_ten_indexes]
  #top ten prediction labels
  top_ten_labels = unique_breeds[top_ten_indexes]

  #setup plot
  top_plot = plt.bar(np.arange(len(top_ten_labels)),top_ten_values, color='grey')
  plt.xticks(np.arange(len(top_ten_labels)), labels=top_ten_labels, rotation='vertical')

  #Change the color of the true label
  if np.isin(unique_breeds[true_label.argmax()], top_ten_labels):
    top_plot[np.argmax(top_ten_labels == unique_breeds[true_label.argmax()])].set_color("green")
  else:
    pass

In [None]:
#Breed probability values for the previous image
plot_pred_conf(predictions, val_labels, n=1)

In [None]:
np.isin(5,np.array([0,1,2,3,4,5]))

In [None]:
#plotting the images along with the top ten predictions
i_multiplier = 30
num_rows = 3
num_columns = 2
num_images = num_rows*num_columns

plt.figure(figsize=(5*2*num_columns, 5*num_rows))

for i in range(num_images):
  plt.subplot(num_rows,2*num_columns, 2*i+1)
  plot_prediction(predictions,val_labels,val_images, n=i+i_multiplier)
  plt.subplot(num_rows,2*num_columns, 2*i+2)
  plot_pred_conf(predictions,val_labels,n=i+i_multiplier)

plt.tight_layout(h_pad=1.0)

Create a confusion matrix with this model

## Funcions for aving and loading the model

In [None]:
def save_model(model, suffix=None):
  """
  Saves a given model in a models directory and appends a suffix (str)
  for clarity and reuse.
  """
  # Create model directory with current time
  modeldir = os.path.join("/content/drive/MyDrive/Dog breed classification project/models",
                          datetime.datetime.now().strftime("%Y%m%d-%H%M%s"))
  model_path = modeldir + "-" + suffix + ".h5" # save format of model
  print(f"Saving model to: {model_path}...")
  model.save(model_path)
  return model_path

In [None]:
def load_model(model_path):
  """
  Loads a saved model from a specified path.
  """
  print(f"Loading saved model from: {model_path}")
  model = tf.keras.models.load_model(model_path,
                                     custom_objects={"KerasLayer":hub.KerasLayer})
  return model

## Training a model on the full data

In [None]:
len(X), len(y)

In [None]:
# Creating batches of the full data
full_data = create_data_batches(X,y)

In [None]:
#Instantiating a new model
full_model = create_model()

In [None]:
#Creating the full model callbacks
#tensorboard
full_model_tensorboard = create_tensorboard_callback()
#early stopping
full_early_stop = tf.keras.callbacks.EarlyStopping(monitor='accuracy', patience=3)

In [None]:
#Number of epochs
NUM_EPOCHS = 100 #@param {type:"slider", min:10, max:100, step:10}

In [None]:
#Training the full data model
full_model.fit(x=full_data,
               epochs=NUM_EPOCHS,
               callbacks=[full_model_tensorboard, full_early_stop])

In [None]:
#saving the full model
save_model(full_model,'full_dataset')

In [None]:
#loading the full model
full_model = load_model('/content/drive/MyDrive/Dog breed classification project/models/20240325-23381711409907-full_dataset.h5')

In [None]:
full_model.summary()

## Making predictions on the test set

The test data has to be put into batches because that is what the model trained on.

In [None]:
#Load test image filenames
test_path = '/content/drive/MyDrive/Dog breed classification project/test/'
test_filenames = [test_path + fname for fname in os.listdir(test_path)]
test_filenames[:5]

In [None]:
#Creating data batches of the test set
test_data = create_data_batches(test_filenames, test_data = True)

In [None]:
test_data

In [None]:
test_predictions = full_model.predict(test_data, verbose=1)

In [None]:
#save predictions
np.savetxt('/content/drive/MyDrive/Dog breed classification project/test-predictions.csv', test_predictions, delimiter=',')

In [None]:
#load predictions
test_predictions = np.loadtxt('/content/drive/MyDrive/Dog breed classification project/test-predictions.csv', delimiter=',')

## Preparing prediction set for Kaggle

We must create a csv with the id of the image and the predicted probability for each unique breed.

In [None]:
#Setting the columns of the dataframe
df = pd.DataFrame(columns=['id']+list(unique_breeds))

In [None]:
#Append test image id's to the prediction DataFrame
test_ids = [os.path.splitext(path)[0] for path in os.listdir(test_path)]
df['id'] = test_ids

In [None]:
#Add the prediction probabilities to each dob breed column
df[list(unique_breeds)] = test_predictions

In [None]:
df

In [None]:
#Saving the prediction
df.to_csv('/content/drive/MyDrive/Dog breed classification project/full_model_predictions.csv', index=False)

## Making predictions on custom images

The steps are the same as those for the test set.
Steps:
* Get the paths
* Create batches with the images
* input the batch to the predict function of the model
* convert the prediction to labels

In [None]:
#creating custom filepaths
custom_path = '/content/drive/MyDrive/Dog breed classification project/Custom images/'
custom_filepaths = [custom_path + fnames for fnames in os.listdir(custom_path)]
custom_filepaths

In [None]:
#inputting the filepath directly
custom_data = create_data_batches(custom_filepaths, test_data = True)

In [None]:
custom_predictions = full_model.predict(custom_data)

In [None]:
get_prediction_label(custom_predictions[0])

In [None]:
custom_pred_labels = [get_prediction_label(custom_predictions[i]) for i in range(len(custom_predictions))]

In [None]:
custom_pred_labels

In [None]:
#loop through unbatched data
custom_images = []
for image in custom_data.unbatch().as_numpy_iterator():
  custom_images.append(image)

custom_images

In [None]:
plt.figure(figsize=(10,10))
plt.imshow(custom_images[0])
plt.title(f'{np.max(custom_predictions[0])*100:.2f}% {custom_pred_labels[0]}')
plt.xticks([])
plt.yticks([]);