In [None]:
#!unzip "drive/MyDrive/Dog-Vision/dog-breed-identificatioan.zip" -d "drive/MyDrive/Dog-Vision/"

## END-TO-END-DOG-BREED-CLASSIFICATION-PROJECT

Dogs are incredible. But have you ever been sitting at a cafe, seen a dog and not known what breed it is? I have. And then someone says, "it's an English Terrier" and you think, how did they know that?

In this project we're going to be using machine learning to help us identify different breeds of dogs.

To do this, we'll be using data from the Kaggle dog breed identification competition. It consists of a collection of 10,000+ labelled images of 120 different dog breeds.

This kind of problem is called multi-class image classification. It's multi-class because we're trying to classify mutliple different breeds of dog. If we were only trying to classify dogs versus cats, it would be called binary classification (one thing versus another).

Multi-class image classification is an important problem because it's the same kind of technology Tesla uses in their self-driving cars or Airbnb uses in atuomatically adding information to their listings.

Since the most important step in a deep learng problem is getting the data ready (turning it into numbers), that's what we're going to start with.

We're going to go through the following TensorFlow/Deep Learning workflow:

1. Get data ready (download from Kaggle, store, import).
2. Prepare the data (preprocessing, the 3 sets, X & y).
3. Choose and fit/train a model (TensorFlow Hub, tf.keras.applications, TensorBoard, EarlyStopping).
4. Evaluating a model (making predictions, comparing them with the ground truth labels).
5. Improve the model through experimentation (start with 1000 images, make sure it works, increase the number of images).
6. Save, sharing and reloading your model (once you're happy with the results).

For preprocessing our data, we're going to use TensorFlow 2.x. The whole premise here is to get our data into Tensors (arrays of numbers which can be run on GPUs) and then allow a machine learning model to find patterns between them.

For our machine learning model, we're going to be using a pretrained deep learning model from TensorFlow Hub.

The process of using a pretrained model and adapting it to your own problem is called transfer learning. We do this because rather than train our own model from scratch (could be timely and expensive), we leverage the patterns of another model which has been trained to classify images.

## Getting our workspace ready
Before we get started, since we'll be using TensorFlow 2.x and TensorFlow Hub (TensorFlow Hub), let's import them.

NOTE: Don't run the cell below if you're already using TF 2.x.

In [None]:
## Import Tensorflow into Google collab
import tensorflow as tf
print ("Tensorflow version : ",tf.__version__)

In [None]:
import tensorflow_hub as hub
print ("Tensorflow Hub version : ",hub.__version__)

In [None]:
print("GPU", "available (YESS!!!!)" if tf.config.list_physical_devices("GPU") else "not available :(")

# Read Labels

In [None]:
import pandas as pd
labels_csv=pd.read_csv("drive/MyDrive/Dog-Vision/labels.csv")

In [None]:
labels_csv.describe()

In [None]:
labels_csv.head()

In [None]:
labels_csv["breed"].value_counts()

In [None]:
labels_csv["breed"].value_counts().plot.bar(figsize=(20,10))

In [None]:
from IPython.display import Image

In [None]:
Image("drive/MyDrive/Dog-Vision/train/001513dfcb2ffafc82cccf4d8bbaba97.jpg")

In [None]:
labels_csv.head()

In [None]:
filename=["drive/MyDrive/Dog-Vision/train/"+fname+".jpg" for fname in labels_csv["id"]]

In [None]:
filename

In [None]:
import os
if len(os.listdir("drive/MyDrive/Dog-Vision/train/")) ==len(filename):
  print("Filenames are equal . Proceed")
else:
  print("Filenames didnt match")


In [None]:
Image(filename[9000])

In [None]:
labels_csv["breed"][9000]

We got list of our paths now convets images to numbers.

In [None]:
import numpy as np
labels = labels_csv["breed"].to_numpy() 
# labels = np.array(labels) # does same thing as above
labels

In [None]:
len(labels)

In [None]:
if len(labels)==len(filename):
  print("Numebrs are equal proceed")
else:
  print("Numbers arent equal")

In [None]:

unique_breeds_list = np.unique(labels)
len(unique_breeds_list)

> Turn single label into array of booleans

This will convert create a list of boolens total elements will be the number of unique breeds and it will e only trur at 1 place ( where the actual breed is present in the list (position of breed). BostonBull is true at 19 and in unique breeds its present at 19|)

In [None]:
print(labels[0])
## true where the label 0 exsist in unique_breeds_list
labels[0]==unique_breeds_list

`Convert our every label of image into an np.array where one index is true and rest of them are false. Position of that one true will tell us about the brred each dog it fit in `

In [None]:
## convert all labels to boolean
boolen_labels=[label == unique_breeds_list for label in labels]

In [None]:
len(boolen_labels)

## Turning boolean into integer

In [None]:
print(labels[0])## original label
print(np.where(unique_breeds_list== labels[0]))##index where label occurs
print(boolen_labels[0].argmax())## index where label occurs in boolean array 
print(boolen_labels[0].astype(int))## there must be a 1 where sample label occurs

In [None]:
filename[:10]

In [None]:
boolen_labels[:2]

## Create our own validation set 
> since Kaggle doesnt provide the validation set we will crate our own

In [None]:
# create x and y
x=filename
y=boolen_labels

> we are going to start with 1000 images and radually images

In [None]:
NUM_IMAGES=1000 #@param {type:"slider",min:1000,max:10000,step:100}

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_val,y_train,y_val=train_test_split(
    x[:NUM_IMAGES],
    y[:NUM_IMAGES],
    test_size=0.2,
    random_state=42,
)


In [None]:
len(x_train),len(x_val),len(y_train),len(y_val)

In [None]:
x_train[:2],y_val[:2]

## Convet into tensers

##### X values are file paths and Y values are boolean representation of those filepaths

Perprocessing (Turning images into tensores)
1. Take image filepath as input
2. Use tensorflow to read filepath and save it to "image"
3. Turn our image jpg into tensors
4. Resize image to (244,244)
5. Reload Modified image

##### See how an image looks 

In [None]:
from matplotlib.pyplot import imread
image=imread(filename[42])
image.shape

In [None]:
image.max(),image.min()

In [None]:
image

In [None]:
## Convert image to tensor
tf.constant(image)

In [None]:
Image(filename[25])

In [None]:
image=imread(filename[25])
image

In [None]:
tf.constant(image)

## Create a function to preprocess image 
1. Take image filepath as input
2. Use tensorflow to read filepath and save it to "image"
3. Turn our image jpg into tensors
 3.1 . Normalize cololur channels from [0-255] to [0 to 1]
4. Resize image to (244,244)
5. Reload Modified image

In [None]:
IMG_SIZE=224

def process_images(image_path):
  """
  Takes an image file path and turns it into a Tensor.
  """
  # Read in image file
  image = tf.io.read_file(image_path)
  # Turn the jpeg image into numerical Tensor with 3 colour channels (Red, Green, Blue)
  image = tf.image.decode_jpeg(image, channels=3)
  # Convert the colour channel values from 0-225 values to 0-1 values
  image = tf.image.convert_image_dtype(image, tf.float32)
  # Resize the image to our desired size (224, 244)
  image = tf.image.resize(image, size=[IMG_SIZE, IMG_SIZE])
  return image

## Turning data into batches

> Why we need to convert data into batch size
Our computer wont be able to fit whole data into one go
thats why we arrange them in patches of 32 so it wil be easy for computer to process

> before we convert images to patches we need tuples look like these(`image`,`label`)

In [None]:
# create a simple function to return a tuple of tensors


  # Create a simple function to return a tuple (image, label)
def get_images_label(image_path, label):
  """
  Takes an image file path name and the associated label,
  processes the image and returns a tuple of (image, label).
  """
  image = process_images(image_path)
  return image, label

In [None]:
(process_images(x[42]),tf.constant(y[42]))

Now we got a way to turn our data into tensors and labels now we have to make a function to convert x and y into batches

In [None]:
BATCH_SIZE=32

In [None]:
# 

# Define the batch size, 32 is a good default
BATCH_SIZE = 32

# Create a function to turn data into batches
def create_data_batches(x, y=None, batch_size=BATCH_SIZE, valid_data=False, test_data=False):
  """
  Creates batches of data out of image (x) and label (y) pairs.
  Shuffles the data if it's training data but doesn't shuffle it if it's validation data.
  Also accepts test data as input (no labels).
  """
  # If the data is a test dataset, we probably don't have labels
  if test_data:
    print("Creating test data batches...")
    data = tf.data.Dataset.from_tensor_slices((tf.constant(x))) # only filepaths
    data_batch = data.map(process_images).batch(BATCH_SIZE)
    return data_batch
  
  # If the data if a valid dataset, we don't need to shuffle it
  elif valid_data:
    print("Creating validation data batches...")
    data = tf.data.Dataset.from_tensor_slices((tf.constant(x), # filepaths
                                               tf.constant(y))) # labels
    data_batch = data.map(get_images_label).batch(BATCH_SIZE)
    return data_batch

  else:
    # If the data is a training dataset, we shuffle it
    print("Creating training data batches...")
    # Turn filepaths and labels into Tensors
    data = tf.data.Dataset.from_tensor_slices((tf.constant(x), # filepaths
                                              tf.constant(y))) # labels
    
    # Shuffling pathnames and labels before mapping image processor function is faster than shuffling images
    data = data.shuffle(buffer_size=len(x))

    # Create (image, label) tuples (this also turns the image path into a preprocessed image)
    data = data.map(get_images_label)

    # Turn the data into batches
    data_batch = data.batch(BATCH_SIZE)
  return data_batch

In [None]:
train_data=create_data_batches(x_train,y_train)
valid_data=create_data_batches(x_val,y_val,valid_data=True)

In [None]:
train_data.element_spec,valid_data.element_spec

## Visualizing Data 
> Thses area a bit hard to comprehend/Understand  so we will like to see them in action

In [None]:
import matplotlib.pyplot as plt 
# Create a function for viewing images in a data batch
def show_25_images(images, labels):
  """
  Displays 25 images from a data batch.
  """
  # Setup the figure
  plt.figure(figsize=(10, 10))
  # Loop through 25 (for displaying 25 images)
  for i in range(25):
    # Create subplots (5 rows, 5 columns)
    ax = plt.subplot(5, 5, i+1)
    # Display an image
    plt.imshow(images[i])
    # Add the image label as the title
    plt.title(unique_breeds_list[labels[i].argmax()])
    # Turn gird lines off
    plt.axis("off")



In [None]:
train_images, train_labels = next(train_data.as_numpy_iterator())
show_25_images(train_images, train_labels)

In [None]:
val_images,val_labels=next(valid_data.as_numpy_iterator())

In [None]:
show_25_images(val_images,val_labels)

## Before we build a model 
`few things we need to define`
1. Input shape (Our image shape in from of tensors)to our model 
2. Output shape(Image labels in from of tensors) to our model
3. Url we want to use from tensorflow HUB
https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/5

In [None]:
INPUT_SHAPE=[None,IMG_SIZE,IMG_SIZE,3]
OUTPUT_SHAPE=len(unique_breeds_list)
MODEL_URL="https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/5"

ow we've got the inputs, outputs and model we're using ready to go. We can start to put them together

There are many ways of building a model in TensorFlow but one of the best ways to get started is to use the Keras API.

Defining a deep learning model in Keras can be as straightforward as saying, "here are the layers of the model, the input shape and the output shape, let's go!"

Knowing this, let's create a function which:

* Takes the input shape, output shape and the model we've chosen's URL as parameters.
* Defines the layers in a Keras model in a sequential fashion (do this first, then this, then that).
* Compiles the model (says how it should be evaluated and improved).
* Builds the model (tells it what kind of input shape it'll be getting).
* Returns the model.
* We'll take a look at the code first, then dicuss each part.

In [None]:
# function to create model
def create_model(input_shape=INPUT_SHAPE,output_shape=OUTPUT_SHAPE,model_url=MODEL_URL):
  # Setup model layers
  model=tf.keras.Sequential([
      hub.KerasLayer(MODEL_URL) , # Layer1 input Layer
      tf.keras.layers.Dense( units=OUTPUT_SHAPE,
       activation ="softmax")
  ])
  # Compile the model
  model.compile(
      loss=tf.keras.losses.CategoricalCrossentropy(),
      optimizer=tf.keras.optimizers.Adam(),
      metrics=["Accuracy"]
  )
  # Build The model
  model.build(INPUT_SHAPE)
  return model

In [None]:
model=create_model()

In [None]:
model.summary()

## Create some call backs 
> call backs are helpfull functions that a model can use during training and do things such as save its progress and also stop execution if model isnt improving.


We will create two models 
* One of tensor board to keep track of our progress.
* Second to stop model to prevent overfitting

## Tensorboard call back

In [None]:
# Load tensorboard
%load_ext tensorboard

> To setup tensorboard callback we need to do 3 things:
1. Load tensorboard notebook extention.
2. Create tensorboard callbacks which is able to save logs into a directory and pass it to out `fit()` function
3. Visualize our logs training with %tensorboard magic function(do this after model training)

In [None]:
import datetime


def create_tensorboard_callback():
  # Create a log directory for storing TensorBoard logs
  logdir = os.path.join("/drive/MyDrive/Dog-Vision/logs",
                        # Make it so the logs get tracked whenever we run an experiment
                        datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
  return tf.keras.callbacks.TensorBoard(logdir)

## Early stopping callabck
check this resourse https://keras.io/api/callbacks/early_stopping/

In [None]:

early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_Accuracy",
                                                  patience=3) # stops after 3 rounds of no improvements

## Training a model

In [None]:
NUM_EPHOCS=100 #@param{type:"slider",min:10,max:100,step:10}

# create functions that train model

* create a model using `create_model()`
* create a tensorboard call back using `create tensorboard call back`
* call fit fuction on model passing train,val data passing no. of ephocs 
* Return model

In [None]:
def train_model():
  """
  Train a given model and return a trained version
  """
  ## create model
  model=create_model()
  ## Create new tensorboard session every time we train a new model
  tensorboard=create_tensorboard_callback()
  # Fit model to data passing it callbacks we created
  model.fit(x=train_data,
            epochs=NUM_EPHOCS,
            validation_data=valid_data,
            validation_freq=1,
            callbacks=[tensorboard,early_stopping])
  return model

In [None]:
model=train_model()

> OverFitting a model is a good thing .It means that our model is learning!!!!

Now our model has been trained, we can make its performance visual by checking the TensorBoard logs.

The TensorBoard magic function (%tensorboard) will access the logs directory we created earlier and viualize its contents.

In [None]:
%tensorboard --logdir drive/MyDrive/Dog-Vision/logs

## Make predictions

In [None]:
predictions=model.predict(valid_data,verbose=1)

In [None]:
predictions.shape

In [None]:
predictions[0]

In [None]:
len(predictions[0])

In [None]:
# First prediction
index=42
print(predictions[0])
print(f"Max value (probability of prediction): {np.max(predictions[index])}") # the max probability value predicted by the model
print(f"Sum: {np.sum(predictions[index])}") # because we used softmax activation in our model, this will be close to 1
print(f"Max index: {np.argmax(predictions[index])}") # the index of where the max value in predictions[0] occurs
print(f"Predicted label: {unique_breeds_list[np.argmax(predictions[index])]}") # the predicted label

## Visualizing images on which perdiction is being made 
> Note : Prediction Probability is also called confidance interval

In [None]:
def get_pred_label(prediction_probabilities):
  """
  Turns an array of prediction probabilities into a label.
  """
  return unique_breeds_list[np.argmax(prediction_probabilities)]

# Get a predicted label based on an array of prediction probabilities
pred_label = get_pred_label(predictions[0])
pred_label

Since our validation data is still in batch format we need to unbatch it for thet we need to create a function to unbatch all batchs

In [None]:
# Create a function to unbatch a batched dataset
def unbatchify(data):
  """
  Takes a batched dataset of (image, label) Tensors and returns separate arrays
  of images and labels.
  """
  images = []
  labels = []
  # Loop through unbatched data
  for image, label in data.unbatch().as_numpy_iterator():
    images.append(image)
    labels.append(unique_breeds_list[np.argmax(label)])
  return images, labels

# Unbatchify the validation data
val_images, val_labels = unbatchify(valid_data)
val_images[0], val_labels[0]

In [None]:
get_pred_label(val_labels[0])

In [None]:
get_pred_label(predictions[0])

Now we got the ways of getting:
* Predicting labels
* Predicting Images
* Validation Images

Lets create a function to visualize them
* That function will take prediction probabilities,array of truth labels and array of images and integers
* convert prediction probability into prediction label
* Plot predicted label ,its predicted probability,the truth label and target on single plot

In [None]:
def plot_pred(prediction_probabilities, labels, images, n=1):
  """
  View the prediction, ground truth label and image for sample n.
  """
  pred_prob, true_label, image = prediction_probabilities[n], labels[n], images[n]
  
  # Get the pred label
  pred_label = get_pred_label(pred_prob)
  
  # Plot image & remove ticks
  plt.imshow(image)
  plt.xticks([])
  plt.yticks([])

  # Change the color of the title depending on if the prediction is right or wrong?
  if pred_label == true_label:
    color = "green"
  else:
    color = "red"

  plt.title("{} {:2.0f}% ({})".format(pred_label,
                                      np.max(pred_prob)*100,
                                      true_label),
                                      color=color)

In [None]:
plot_pred(prediction_probabilities=predictions,
          labels=val_labels,
          images=val_images,
          n=77)

In [None]:
def save_model(model, suffix=None):
  """
  Saves a given model in a models directory and appends a suffix (string).
  """
  # Create a model directory pathname with current time
  modeldir = os.path.join("drive/MyDrive/Dog-Vision/models",
                          datetime.datetime.now().strftime("%Y%m%d-%H%M%s"))
  model_path = modeldir + "-" + suffix + ".h5" # save format of model
  print(f"Saving model to: {model_path}...")
  model.save(model_path)
  return model_path

In [None]:
# Create a function to load a trained model
def load_model(model_path):
  """
  Loads a saved model from a specified path.
  """
  print(f"Loading saved model from: {model_path}")
  model = tf.keras.models.load_model(model_path, 
                                     custom_objects={"KerasLayer":hub.KerasLayer})
  return model

In [None]:
save_model(model,suffix="1000-image-adam-mobilenetv2-model")

In [None]:
loaded_1000_image_model=load_model("/content/drive/MyDrive/Dog-Vision/models/20220819-20261660940784-1000-image-adam-mobilenetv2-model.h5")

In [None]:
model.evaluate(valid_data)

In [None]:
loaded_1000_image_model.evaluate(valid_data)

## Training a big dog model (on the full data)

In [None]:
len(x),len(y)

In [None]:
full_data=create_data_batches(x,y)

In [None]:
full_data

In [None]:
full_model=create_model()

In [None]:
full_model_tensorbaord=create_tensorboard_callback()


In [None]:
full_model_early_stopping = tf.keras.callbacks.EarlyStopping(monitor="Accuracy",
                                                  patience=3)

In [None]:
model.fit(x=full_data,
          epochs=NUM_EPHOCS,
          callbacks=[full_model_tensorbaord,full_model_early_stopping])

In [None]:
save_model(full_model,suffix="full_imageset_mobilnetv2_adam")

In [None]:
loaded_model_full=load_model("drive/MyDrive/Dog-Vision/models/20220820-18031661018618-full_imageset_mobilnetv2_adam.h5")

## Preprocessing TestData

Making predictions on the test dataset
Since our model has been trained on images in the form of Tensor batches, to make predictions on the test data, we'll have to get it into the same format.

Luckily we created create_data_batches() earlier which can take a list of filenames as input and conver them into Tensor batches.

To make predictions on the test data, we'll:

Get the test image filenames. ✅
Convert the filenames into test data batches using create_data_batches() and setting the test_data parameter to True (since the test data doesn't have labels). ✅
Make a predictions array by passing the test batches to the predict() method called on our model.

In [None]:
test_filenames=["drive/MyDrive/Dog-Vision/test/" + fname for fname in os.listdir("drive/MyDrive/Dog-Vision/test")]

In [None]:
test_filenames

####create test data batches

In [None]:

test_data= create_data_batches(test_filenames,test_data=True)

In [None]:
test_data

In [None]:
test_predictios=loaded_model_full.predict(test_data,verbose=1)

In [None]:
np.savetxt("drive/MyDrive/Dog-Vision/preds-array.csv",test_predictios,delimiter=",")

In [None]:
np.loadtxt("drive/MyDrive/Dog-Vision/preds-array.csv",delimiter=",")

In [None]:
test_predictios.shape

## Preparing test dataset predictions for Kaggle
Looking at the Kaggle sample submission, we find that it wants our models prediction probaiblity outputs in a DataFrame with an ID and a column for each different dog breed. https://www.kaggle.com/c/dog-breed-identification/overview/evaluation

To get the data in this format, we'll:

Create a pandas DataFrame with an ID column as well as a column for each dog breed. ✅
Add data to the ID column by extracting the test image ID's from their filepaths.
Add data (the prediction probabilites) to each of the dog breed columns.
Export the DataFrame as a CSV to submit it to Kaggle.

In [None]:
preds_df=pd.DataFrame(columns=["id"]+list(unique_breeds_list))

In [None]:
preds_df.head()

In [None]:
test_path="drive/MyDrive/Dog-Vision/test/"

In [None]:
test_ids= [os.path.splitext(path)[0] for path in os.listdir(test_path)]

In [None]:
preds_df["id"]=test_ids

In [None]:
preds_df.tail()

In [None]:
preds_df[list(unique_breeds_list)] = test_predictios

In [None]:
preds_df

In [None]:
preds_df.to_csv("drive/MyDrive/Dog-Vision/full_model_submission_for_model_2",index=False)