# Dog Breed Identification
This notebook uses transfer learning to build a Multi-Class Image Classifier using TensorFlow 2.x and TensorFlow Hub.

## 1. Problem

We are provided with a training set and a test set of images of dogs. Each image has a filename that is its unique id. The dataset comprises 120 breeds of dogs. The goal is to create a classifier capable of determining a dog's breed from a photo.

## 2. Data

The data we're using is from Kaggle's dog breed identification competition.

https://www.kaggle.com/c/dog-breed-identification/data 

## 3. Evaluation

The evaluation is a file with prediction probabilities for each dog breed of each test image.

https://www.kaggle.com/c/dog-breed-identification/overview/evaluation

## 4. Features
* There are 120 breeds of dogs (this means there are 120 different classes).
* There are around 10,222 images in the training set (these images have labels.
* There are around 10,222 images in the test set (these images have no labels, because we'll want to predict them).

### Getting the Data and Importing the Libraries
We will start of by getting the data from Kaggle, using the Kaggle api but will do a pip "force install" first in order to prevent any errors. 

In [None]:
!pip install --upgrade --force-reinstall --no-deps kaggle

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from keras.utils import to_categorical
import tensorflow as tf
import tensorflow_hub as hub
print(tf.__version__) #Make sure Tensorflow 2 is imported

In [None]:
# Adding the Username and Key from the Kaggle Token Folder
os.environ['KAGGLE_USERNAME']="hassanjoumaa"
os.environ['KAGGLE_KEY']="5e66163ab8d43def76ee3643557bea64"

In [None]:
# Downloading the Dataset from Kaggle
!kaggle competitions download -c dog-breed-identification

In [None]:
# Unziping the Folder
!unzip dog-breed-identification.zip

In [None]:
# Importing the labels.csv to a pandas dataframe
df = pd.read_csv("./labels.csv")
df.head(10)

In [None]:
df.describe()

In [None]:
#Viewing the data distribution
df["breed"].value_counts().plot.bar(figsize=(20,10))

### Preprocessing the Data

In [None]:
# Geting the list of training filenames
filenames = ['/content/train/' + fname +'.jpg' for fname in df["id"]]
print(filenames[:10])
if len(os.listdir('/content/train')) == len(filenames):
  print("Amount of files match!")
else:
  print("Amount of files don't match!")

In [None]:
# Getting the unique labels and the list of breeds
labels = list(df["breed"])
unique_breeds = np.unique(labels)
print("Labels:",labels[:10],"\n")
print("Unique Breads:",unique_breeds[:10])

> One Hot Encoding the Labels

In [None]:
lbl=LabelEncoder()
labels=lbl.fit_transform(labels)

In [None]:
labels = to_categorical(labels)
print("Number of unique labels:",len(labels[0]))
labels[0]

In [None]:
X = filenames
y = labels
len(X)

> Splitting the data into Training and Validation

In [None]:
NUM_IMAGES = 1000 #@param {type:"slider", min:1000, max:10222, step:2}

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X[:NUM_IMAGES],
                                                  y[:NUM_IMAGES],
                                                  test_size=0.2,
                                                  random_state=42)

len(X_train), len(X_val), len(y_train), len(y_val) 

### ***Getting the Data Ready in Batches***

We will define some functions to process the images and to put the data in Batches.

In [None]:
# Define image size
IMG_SIZE = 224

# Create a function for preprocessing images
def process_image(image_path, img_size=IMG_SIZE):
  """
  Takes an image file path and turns the image into a Tensor.
  """
  # Read in an image file
  image = tf.io.read_file(image_path)
  # Turn the jpeg image into numerical Tensor with 3 colour channels (Red, Green, Blue)
  image = tf.image.decode_jpeg(image, channels=3)
  # Convert the colour channel values from 0-255 to 0-1 values
  image = tf.image.convert_image_dtype(image, tf.float32)
  # Resize the image to our desired value (224, 224)
  image = tf.image.resize(image, size=[IMG_SIZE, IMG_SIZE])

  return image

In [None]:
# Create a simple function to return a tuple (image, label)
def get_image_label(image_path, label):
  """
  Takes an image file path name and the assosciated label,
  processes the image and reutrns a tuple of (image, label).
  """
  image = process_image(image_path)
  return image, label

In [None]:
# Define the batch size
BATCH_SIZE = 32

# Create a function to turn data into batches
def create_data_batches(X, y=None, batch_size=BATCH_SIZE, valid_data=False, test_data=False):
  """
  Creates batches of data out of image (X) and label (y) pairs.
  Shuffles the data if it's training data but doesn't shuffle if it's validation data.
  Also accepts test data as input (no labels).
  """
  # If the data is a test dataset
  if test_data:
    print("Creating test data batches...")
    data = tf.data.Dataset.from_tensor_slices((tf.constant(X))) # only filepaths (no labels)
    data_batch = data.map(process_image).batch(BATCH_SIZE)
    return data_batch
  
  # If the data is a valid dataset
  elif valid_data:
    print("Creating validation data batches...")
    data = tf.data.Dataset.from_tensor_slices((tf.constant(X), # filepaths
                                               tf.constant(y))) # labels
    data_batch = data.map(get_image_label).batch(BATCH_SIZE)
    return data_batch

  else:
    print("Creating training data batches...")
    # Turn filepaths and labels into Tensors
    data = tf.data.Dataset.from_tensor_slices((tf.constant(X),
                                               tf.constant(y)))
    # Shuffling pathnames and labels before mapping image processor function is faster than shuffling images
    data = data.shuffle(buffer_size=len(X))

    # Create (image, label) tuples (this also turns the image path into a preprocessed image)
    data = data.map(get_image_label)

    # Turn the training data into batches
    data_batch = data.batch(BATCH_SIZE)
  return data_batch

In [None]:
# Create training and validation data batches
train_data = create_data_batches(X_train, y_train)
val_data = create_data_batches(X_val, y_val, valid_data=True)

In [None]:
# Check out the different attributes of our data batches
train_data.element_spec, val_data.element_spec

In [None]:
# Create a function for viewing images in a data batch
def show_25_images(images, labels):
  """
  Displays a plot of 25 images and their labels from a data batch.
  """
  # Setup the figure
  plt.figure(figsize=(15, 15))
  # Loop through 25 (for displaying 25 images)
  for i in range(25):
    # Create subplots (5 rows, 5 columns)
    ax = plt.subplot(5, 5, i+1)
    # Display an image 
    plt.imshow(images[i])
    # Add the image label as the title
    plt.title(unique_breeds[labels[i].argmax()])
    # Turn the grid lines off
    plt.axis("off")

In [None]:
# Now let's visualize the data in a training batch
train_images, train_labels = next(train_data.as_numpy_iterator())
show_25_images(train_images, train_labels)

### Creating the Model
We will be using the imagenet-mobilenet_v2_130_224-classification

In [None]:
INPUT_SHAPE = [None, IMG_SIZE, IMG_SIZE, 3]
OUTPUT_SHAPE = len(unique_breeds)
MODEL_URL = "https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/4"

In [None]:
# Setup the model layers
model = tf.keras.Sequential([
    hub.KerasLayer(MODEL_URL), # Layer 1 (input layer)
    tf.keras.layers.Dense(units=OUTPUT_SHAPE,activation="softmax") # Layer 2 (output layer)
  ])

# Compile the model
model.compile(
      loss=tf.keras.losses.CategoricalCrossentropy(),
      optimizer=tf.keras.optimizers.Adam(),
      metrics=["accuracy"]
  )

# Build the model
model.build(INPUT_SHAPE)
model.summary()

In [None]:
# Load TensorBoard notebook extension
%load_ext tensorboard

In [None]:
!mkdir ./logs
# Create a function to build a TensorBoard callback
def create_tensorboard_callback():
  # Create a log directory for storing TensorBoard logs
  logdir = os.path.join("./logs",
                        # Make it so the logs get tracked whenever we run an experiment
                        datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
  return tf.keras.callbacks.TensorBoard(logdir)

In [None]:
# Create early stopping callback
early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_accuracy",
                                                  patience=3)

In [None]:
NUM_EPOCHS = 100
tensorboard = create_tensorboard_callback()

# Fit the model to the data passing it the callbacks we created
model.fit(x=train_data,
          epochs=NUM_EPOCHS,
          validation_data=val_data,
          validation_freq=1,
          callbacks=[tensorboard, early_stopping])

### Viewing the performance of our Model
Clearly our model is currently overfitting but training on the full dataset (using more data) might fix the problem.

In [None]:
#Uncomment below to view the Tensorboard
#%tensorboard --logdir /content/logs

In [None]:
predictions = model.predict(val_data, verbose=1)
predictions

In [None]:
# Turn prediction probabilities into their respective label (easier to understand)
def get_pred_label(prediction_probabilities):
  """
  Turns an array of prediction probabilities into a label.
  """
  return unique_breeds[np.argmax(prediction_probabilities)]

# Get a predicted label based on an array of prediction probabilities
pred_label = get_pred_label(predictions[81])
pred_label

In [None]:
# Create a function to unbatch a batch dataset
def unbatchify(data):
  """
  Takes a batched dataset of (image, label) Tensors and reutrns separate arrays
  of images and labels.
  """
  images = []
  labels = []
  # Loop through unbatched data
  for image, label in data.unbatch().as_numpy_iterator():
    images.append(image)
    labels.append(unique_breeds[np.argmax(label)])
  return images, labels

# Unbatchify the validation data
val_images, val_labels = unbatchify(val_data)
val_images[0], val_labels[0]

In [None]:
def plot_pred(prediction_probabilities, labels, images, n=1):
  """
  View the prediction, ground truth and image for sample n
  """
  pred_prob, true_label, image = prediction_probabilities[n], labels[n], images[n]

  # Get the pred label
  pred_label = get_pred_label(pred_prob)

  # Plot image & remove ticks
  plt.imshow(image)
  plt.xticks([])
  plt.yticks([])

  # Change the colour of the title depending on if the prediction is right or wrong
  if pred_label == true_label:
    color = "green"
  else:
    color = "red"
  
  # Change plot title to be predicted, probability of prediction and truth label
  plt.title("{} {:2.0f}% {}".format(pred_label,
                                    np.max(pred_prob)*100,
                                    true_label),
                                    color=color)

In [None]:
plot_pred(prediction_probabilities=predictions,
          labels=val_labels,
          images=val_images,
          n=77)

### Saving our Model

In [None]:
# Create a function to save a model
!mkdir ./models
def save_model(model, suffix=None):
  """
  Saves a given model in a models directory and appends a suffix (string).
  """
  # Create a model directory pathname with current time
  modeldir = os.path.join("./models",
                          datetime.datetime.now().strftime("%Y%m%d-%H%M%s"))
  model_path = modeldir + "-" + suffix + ".h5" # save format of model
  print(f"Saving model to: {model_path}...")
  model.save(model_path)
  return model_path

In [None]:
# Create a function to load a trained model
def load_model(model_path):
  """
  Loads a saved model from a specified path.
  """
  print(f"Loading saved model from: {model_path}")
  model = tf.keras.models.load_model(model_path, 
                                     custom_objects={"KerasLayer":hub.KerasLayer})
  return model

In [None]:
# Save our model trained on 1000 images
#save_model(model, suffix="1000-images-mobilenetv2-Adam")

In [None]:
# Load a trained model
#loaded_1000_image_model = load_model('PATH') #Provide the PATH

### Training on the Full Dataset

In [None]:
len(X), len(y)

In [None]:
# Create a data batch with the full data set
full_data = create_data_batches(X, y)

In [None]:
# Setup the model layers
Final_model = tf.keras.Sequential([
    hub.KerasLayer(MODEL_URL), # Layer 1 (input layer)
    tf.keras.layers.Dense(units=OUTPUT_SHAPE,activation="softmax") # Layer 2 (output layer)
  ])

# Compile the model
Final_model.compile(
    loss=tf.keras.losses.CategoricalCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(),
    metrics=["accuracy"]
  )

# Build the model
Final_model.build(INPUT_SHAPE)
Final_model.summary()

In [None]:
# Create final model callbacks
Final_model_tensorboard = create_tensorboard_callback()
Final_model_early_stopping = tf.keras.callbacks.EarlyStopping(monitor="accuracy",
                                                             patience=3)

In [None]:
# Fit the final model to the full data
NUM_EPOCHS=12
Final_model.fit(x=full_data,
               epochs=NUM_EPOCHS,
               callbacks=[Final_model_tensorboard, Final_model_early_stopping])

In [None]:
save_model(Final_model, suffix="full-dataset-mobilenetv2-Adam")

### Creating the submission

In [None]:
# Load test image filenames
test_path = "/content/test/"
test_filenames = [test_path + fname for fname in os.listdir(test_path)]
print(len(test_filenames))
test_filenames[:10]


In [None]:
# Create test data batch
test_data = create_data_batches(test_filenames, test_data=True)

In [None]:
test_data

In [None]:
# Make predictions on test data batch using the loaded final model
test_predictions = Final_model.predict(test_data,verbose=1)

In [None]:
# Create a pandas DataFrame with empty columns
preds_df = pd.DataFrame(columns=["id"] + list(unique_breeds))
preds_df.head()

In [None]:
# Append test image ID's to predictions DataFrame
test_ids = [os.path.splitext(path)[0] for path in os.listdir(test_path)]
preds_df["id"] = test_ids
preds_df.head()

In [None]:
# Add the prediction probabilities to each dog breed column
preds_df[list(unique_breeds)] = test_predictions
preds_df.head()

In [None]:
# Save our predictions dataframe to CSV for submission to Kaggle
preds_df.to_csv("./full_model_predictions_submission.csv",index=False)