# Dog Breed Identification Using Deep Learning

---

## Problem Statement ~

We often do you get stuck thinking about the name of a dog’s breed. There are many dog breeds and most of them are similar to each other. Can we use a dog breeds dataset and build a Deep
Learning model that will classify different dog breeds from an image. Use Convolutional Neural Networks to build the model.

## Dataset

The dataset for this project is available on Kaggle. <br>

**Link** : https://www.kaggle.com/c/dog-breed-identification/data

## Evaluation

We shall use <code>Accuracy</code>, <code>Precision</code>, <code>Recall</code> and <code>F1 score</code> to evaluate the performance of our models, along with the heatmap of the confusion matrix.<br>

---

## Table of Contents

### 1. Environment Setup
### 2. Dataset Gathering
### 3. Exploratory Data Analysis
### 4. Dataset Preprocessing
### 5. Model Evaluation
### 6. Performance Measurement

# 1. Environment Setup:
---

> In this step, we have installed and imported all neccessary libraries required to proceed with the solution to the given problem statement.

In [None]:
import os
import cv2
import tqdm
import random
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")                   # Suppressing Jupyter Notebook Warnings
from IPython.display import display, Image
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score

# 2. Dataset Gathering
---
> In this step, we have gathered the dataset from kaggle and have verified its integrity.

In [None]:
# Importing the labels dataset
labels_csv = pd.read_csv('../input/dog-breed-identification/labels.csv')

# Viewing the head of the dataset
labels_csv.head()

In [None]:
# Saving the training dataset path to a variable
train_path = "../input/dog-breed-identification/train/"

# Creating image paths from the name
filenames = [train_path + fname + ".jpg" for fname in labels_csv['id']]

# Viewing the first 10 filenames
filenames[:10]

In [None]:
# Checking whether the number of filenames in the directory matches to that of ours
print(len(os.listdir(train_path)) == len(filenames))

# 3. Exploratory Data Analysis
---
> In this step, we took a deeper look at the data, and checked if the data is properly gathered in the previous steps.

In [None]:
# Viewing an image using filename
Image("../input/dog-breed-identification/train/0042188c895a2f14ef64a918ed9c7b64.jpg")

In [None]:
# Visualizing the distribution of images accoding to class
labels_csv["breed"].value_counts().plot.bar(figsize=(20, 10));

# 4. Data Preprocessing:
---
> In this step, we have cleaned the data thus obtained for the previous steps before splitting them into training and testing datasets. We have also cleaned the images obtained by reshaping their shapes and changing their color changes.

In [None]:
# Converting the label columns to Numpy array
labels = labels_csv['breed'].to_numpy()

# Viewing the first 10 labels
labels[:10]

In [None]:
# Saving the count of total number of unique breeds to a variabkle
unique_breeds = np.unique(labels)

print("Total number of unique breeds : ", len(unique_breeds))

In [None]:
# Converting the labels to a boolean array
boolean_labels = [label == np.array(unique_breeds) for label in labels]

# Viewing how it looks like
boolean_labels[0]

In [None]:
# Creating training and validation sets

# Separating the features and labels
X = filenames
y = boolean_labels

print(f"Number of training images: {len(X)}")
print(f"Number of labels: {len(y)}")

X_train, X_val, y_train, y_val = train_test_split(X,
                                                  y, 
                                                  test_size=0.2,
                                                  random_state=42)

print(f"Number of training images : {len(X_train)}")
print(f"Number of validation images : {len(X_val)}")

#### Image Preprocessing:

> In this step, we have resized and reshaped the images and we have also changed their color changes.

In [None]:
# Reading an image in and checking shape
image = plt.imread(filenames[42])
print(f"Image Shape : {image.shape}")

# Converting the image to a Tensorflow Tensor
tf.constant(image)

In [None]:
# Setting the Image Size
IMAGE_SIZE = 224

# Creating a function to preprocess the images
def process_image(image_path):
    
    # Read in the image
    image = tf.io.read_file(image_path)
    
    # Turn the image into numerical tensors
    image = tf.image.decode_jpeg(image, channels=3)
    
    # Convert the color channel values from 0-225 to 0-1
    image = tf.image.convert_image_dtype(image, tf.float32)
    
    # Resize the image
    image = tf.image.resize(image, size=[IMAGE_SIZE, IMAGE_SIZE])
    
    return image

#### Batching the Data:
> Here, we have created batches after processing the images with their labels for faster and effective training.

In [None]:
# Creating a function to return a tuple (image, label)
def get_image_label(image_path, label):
    """
    Takes an image file path name and the associated label,
    processes the image and returns a tuple of (image, label).
    """
    image = process_image(image_path)
    return image, label

In [None]:
# Setting the batch size at 32 
BATCH_SIZE = 32

# Create a function to turn data into batches
def create_data_batches(x, y=None, batch_size=BATCH_SIZE, valid_data=False, test_data=False):
    """
    Function to batch the data
    """
    # If the data is a test dataset, we probably don't have labels
    if test_data:
        print("Creating test data batches...")
        data = tf.data.Dataset.from_tensor_slices((tf.constant(x))) # only filepaths
        data_batch = data.map(process_image).batch(BATCH_SIZE)
        return data_batch
  
    # If the data if a valid dataset, we don't need to shuffle it
    elif valid_data:
        print("Creating validation data batches...")
        data = tf.data.Dataset.from_tensor_slices((tf.constant(x), # filepaths
                                                   tf.constant(y))) # labels
        data_batch = data.map(get_image_label).batch(BATCH_SIZE)
        return data_batch

    else:
        # If the data is a training dataset, we shuffle it
        print("Creating training data batches...")
        # Turn filepaths and labels into Tensors
        data = tf.data.Dataset.from_tensor_slices((tf.constant(x), # filepaths
                                                   tf.constant(y))) # labels
    
        # Shuffling pathnames and labels before mapping image processor function is faster than shuffling images
        data = data.shuffle(buffer_size=len(x))

        # Create (image, label) tuples (this also turns the image path into a preprocessed image)
        data = data.map(get_image_label)

        # Turn the data into batches
        data_batch = data.batch(BATCH_SIZE)
    return data_batch

In [None]:
# Creating training and validation data batches
train_data = create_data_batches(X_train, y_train)
val_data = create_data_batches(X_val, y_val, valid_data=True)

In [None]:
# Checking the different attributes of our data batches
train_data.element_spec, val_data.element_spec

# 5. Model Evaluation:
---
> In this step, we have chosen the ResNet50V2 as it poses the most performance in problems such as these. To squeeze out even more performance in this case, we have used Adam optimizer and Categorical Cross Entropy.

In [None]:
# Setup input shape to the model
INPUT_SHAPE = [None, IMAGE_SIZE, IMAGE_SIZE, 3] # batch, height, width, colour channels

# Model URL for ResNet50V2
MODEL_URL = "https://tfhub.dev/tensorflow/resnet_50/classification/1"

# Creating the model for ResNet50V2
model = tf.keras.Sequential([
    # Layer 1 : Input Layer
    hub.KerasLayer(MODEL_URL),
    
    # Layer 2 : Output Layer
    tf.keras.layers.Dense(120, activation='softmax')
])

# Compiling the model
model.compile(loss=tf.keras.losses.CategoricalCrossentropy(),
               optimizer=tf.keras.optimizers.Adam(),
               metrics=['accuracy'])

# Building the model
model.build(INPUT_SHAPE)

# Summary of the model
model.summary()

In [None]:
# Creating Tensorflow EarlyStopping
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=5)
# Fitting the model
hist = model.fit(train_data, epochs=50, validation_data=val_data, callbacks=[early_stopping])

#### Model Evaluation:

> In this step, we have plotted the performance of the model in terms of accuracy vs epochs and loss vs epochs.

In [None]:
# Creating graphs to visualize the accuracy and loss for the models
fig, axes = plt.subplots(nrows=1,ncols=2, figsize=(16, 8), squeeze=False)

fig.tight_layout(pad=5)

plt.style.use('fivethirtyeight')

# Graph for ResNet50V2 Training Accuracy vs Validation Accuracy
axes[0][0].plot(hist.history['accuracy'])
axes[0][0].plot(hist.history['val_accuracy'])
axes[0][0].set_ylabel("Accuracy")
axes[0][0].set_xlabel("Epochs")
axes[0][0].set_title('ResNet50V2 Train Acc vs Val Acc')
axes[0][0].legend(['Train', 'Test'], loc='upper left')

# Graph for ResNet50V2 Training Loss vs Validation Loss
axes[0][1].plot(hist.history['loss'])
axes[0][1].plot(hist.history['val_loss'])
axes[0][1].set_ylabel("Loss")
axes[0][1].set_xlabel("Epochs")
axes[0][1].set_title('ResNet50V2 Train Loss vs Val Loss')
axes[0][1].legend(['Train', 'Test'], loc='upper left')

In [None]:
# Making predictions
predictions = model.predict(val_data, verbose=2)

# Viewing the predictions
predictions[0]

# 6. Performance Measurement
---
> In this step, we have evaluated the performance measure of the model and along with that we have also plotted a heatmap for the confusion matrix.

In [None]:
# Checking the shape of the prediction
print("Viewing the Shape : ", predictions.shape)

# Checking the maximum probability
print(f"Maximum value (probability of prediction) : {np.max(predictions[0])}")

# Maximum index
print(f"Maximum index : {np.argmax(predictions[0])}")

# Predicted label
print(f"Predicted Label : {unique_breeds[np.argmax(predictions[0])]}")

In [None]:
# Creating a function to unbatch the data
def unbatching(data):
    '''
    This fuction is used to unbatch the data
    '''
    # Creating variables to save the images and labels
    images = []
    labels = []
    
    # Looping through the unbatched data
    for image, label in data.unbatch().as_numpy_iterator():
        images.append(image)
        labels.append(unique_breeds[np.argmax(label)])
    return images, labels

# Unbatching the validation data
val_images, val_labels = unbatching(val_data)
val_images[0], val_labels[0]

In [None]:
# Getting the predicted labels
predicted_labels = [unique_breeds[np.argmax(predictions[i])] for i in range(len(predictions))]

In [None]:
confusion_matrix(val_labels, predicted_labels).shape

In [None]:
# Plotting the confusion matrix

import seaborn as sns
from sklearn.metrics import precision_recall_fscore_support
plt.figure(figsize=(50, 50))
mat = confusion_matrix(val_labels, predicted_labels)
sns.heatmap(mat.T, square = True, annot = True, cmap = "rocket", xticklabels = np.unique(val_labels), yticklabels = np.unique(predicted_labels))
plt.xlabel("True Labels")
plt.ylabel("Predicted Labels")
plt.show()

In [None]:
print(classification_report(val_labels, predicted_labels))

In [None]:
# Getting the perfromance metrics of our champion model
print("#******** ResNet50V2 Performance Metrics ********#")
print(" ")
print(f"Accuracy Score  = {accuracy_score(val_labels, predicted_labels) * 100}")
print(f"Precision Score = {precision_score(val_labels, predicted_labels, average='macro') * 100}")
print(f"Recall Score    = {recall_score(val_labels, predicted_labels, average='macro') * 100}")
print(f"F1 Score        = {f1_score(val_labels, predicted_labels, average='macro') * 100}")