
## Lab-10: Cat Dog Classification with CNNs


In [6]:
#Importing libraries
import matplotlib.pyplot as plt
import numpy as np
import os
import shutil
import sys
from glob import glob
from keras.preprocessing import image
from tensorflow.keras.preprocessing import image
from keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPool2D
from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense, Input
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from glob import glob
from tensorflow.keras.preprocessing import image

## Part 1: Importing the dataset from Kaggle

Before we start, lets get some of the basic steps cleared: <br>
<h5> 1. Create a Kaggle account</h5>

- Create an account on Kaggle.com - this is mandatory, since we will be accessing the dataset from Kaggle directly in Google Colab.
- Next go to the link given : https://www.kaggle.com/c/dogs-vs-cats/data. This is the dataset we will be using for this lab. 

<h5> 2. Get the dataset to Google colab</h5>

- On your account on Kaggle - top right corner of the page [Profile , **Account**, Sign Out] - Click on account and scroll down to APIs. 
- Here create a new API token - you should get an option to download a **kaggle.json** file

In [7]:
#Install kaggle
! pip install -q kaggle

In [None]:
#Upload the kaggle.json file you just downloaded
from google.colab import files
files.upload()

TypeError: ignored

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Make directory named kaggle 
!mkdir ~/.kaggle

In [None]:
# copy kaggle.json file there.
! cp kaggle.json ~/.kaggle/

In [None]:
#change the permissions of the file
! chmod 600 ~/.kaggle/kaggle.json 

In [None]:
# Download the dataset zip in this location 
DATA_DIR = '../data' 
IMAGE_DIR= '../data/image'
!mkdir ../data
!mkdir ../data/dogs-vs-cats
!mkdir ../data/images

In [None]:
#downloading the dataset from Kaggle
!kaggle competitions download -c dogs-vs-cats -p {DATA_DIR}

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.8/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.8/dist-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


Unzip the file and delete the other unnecessary files and original zip.

In [None]:
#Unzipping
shutil.unpack_archive(os.path.join(DATA_DIR,'dogs-vs-cats.zip'), DATA_DIR)
os.remove(os.path.join(DATA_DIR, 'dogs-vs-cats.zip')) 
# changing the data directory to KAGGLE_DIR
KAGGLE_DIR = os.path.join(DATA_DIR, 'dogs-vs-cats')
shutil.unpack_archive(os.path.join(DATA_DIR, 'train.zip'), IMAGE_DIR)
os.remove(os.path.join(DATA_DIR, 'train.zip'))
shutil.unpack_archive(os.path.join(DATA_DIR, 'test1.zip'), IMAGE_DIR)
os.remove(os.path.join(DATA_DIR, 'test1.zip'))

os.remove(os.path.join(DATA_DIR, 'sampleSubmission.csv'))


The dogs and cats are all mixed in a single directory. The label is in the file name itself.</br>
We need to create:
1. Train, validation, and test directories, each containing a subset of the images.
2. Separate cat and dog directories _within_ train, validation, and test.

Number 2 is necessary because the Keras ImageDataGenerator's flow_from_directory() method infers the class label from the subdirectory the image resides in.

Therefore, we need to create a directory structure as seen below:


In [None]:
# dogs_vs_cats
# ├── test
# │   ├── cats
# │   └── dogs
# ├── train
# |   ├── cats
# |   └── dogs
# └── validation
#     ├── cats
#     └── dogs

In [None]:
# Create train, validation, and test directories
split_dirs = ['train', 'validation', 'test']
for split_dir in split_dirs:
    # create label subdirectories
    label_dirs = ['dogs', 'cats']
    for label_dir in label_dirs:
        new_dir = os.path.join(KAGGLE_DIR, split_dir, label_dir)
        os.makedirs(new_dir, exist_ok=True)

In [None]:
# copy dataset images into subdirectories based on probability distribution 'p'
PERCENT_OF_DATA = 0.1
np.random.seed(42)
for folder in os.listdir(KAGGLE_DIR):
  for file in os.listdir(os.path.join(IMAGE_DIR,folder)):
    if not file.endswith('.jpg'):
        continue # skip over non-image files
    src = os.path.join(IMAGE_DIR,folder, file)
    if np.random.uniform() > PERCENT_OF_DATA:
        os.remove(src)
        continue
    dst_dir = np.random.choice(['train', 'validation', 'test'], p=[.5, .25, .25])
    if file.startswith('cat'):
        dst = os.path.join(KAGGLE_DIR, dst_dir, 'cats', file)
    elif file.startswith('dog'):
        dst = os.path.join(KAGGLE_DIR, dst_dir, 'dogs', file)
    try:
        shutil.move(src, dst)
    except Exception as e:
        print(e)
#Remove these empty diretories
shutil.rmtree(IMAGE_DIR)

In [None]:
#Number of images in each subdir
for dir_name in split_dirs:
    for label_dir in label_dirs:
        print(dir_name ,label_dir, len(os.listdir(KAGGLE_DIR + '/' + dir_name + '/' + label_dir)))

In [None]:
# Preprocessing the image into a 4D tensor
img_path = glob(KAGGLE_DIR+'/*/*/*.jpg')[0]

img = image.load_img(img_path, target_size=(150, 150))
img_tensor = image.img_to_array(img)
img_tensor = np.expand_dims(img_tensor, axis=0)
img_tensor /= 255.

print(img_tensor.shape)

In [None]:
!ls

In [None]:
# Displaying an example img
plt.imshow(img_tensor[0])
plt.show()

## Part 2: Creating the Generators and training a model with only rescaled images

<b>Create the Generators</b>

Now that we have the data in the correct directory structure we can create the data generators.
Yes, that's correct. We will have _multiple_ generators, one for each split directory.<br>


First we create a main data generator object, `datagen`. This can be a given a wide range of arguments which can be used to preprocess the images it generates.</br>

<b>For right now we will only use the `rescale` argument to normalize all pixel values to between 0 and 1 (remember that 255 is the max pixel value).</b>

In [None]:
datagen = ImageDataGenerator(rescale=1/255)

Now we use `datagen`'s `flow_from_directory` method to create the 3 generators: `traingen`, `valgen`, and `testgen`.<br>

The function needs to be given the following parameters:<br>
- `directory` which they will use as their image source
- `target_size` to resize all images to (75,75)
- `batch_size`
- `class_mode` to instruct the generator on how to interpret the label folders. 

We should probably also set `shuffle = False` in the test generator so it produces the same images in the same order everytime it is used.

In [None]:
batch_size = 16
target_size = (75, 75)

traingen = datagen.flow_from_directory(directory='/data/dogs-vs-cats/train', target_size=target_size, batch_size=batch_size, class_mode='categorical')
 
valgen = datagen.flow_from_directory(directory='/data/dogs-vs-cats/validation', target_size=target_size, batch_size=batch_size, class_mode='categorical')

testgen = datagen.flow_from_directory(directory='/data/dogs-vs-cats/test', target_size=target_size, batch_size=batch_size, class_mode='categorical')

In [None]:
print("Class indices:", traingen.class_indices)


### Construct CNN MODEL

- Build the CNN Model, there is no limitation on number of layers or size of the CNN Model, we leave the design choices to you.  For more information on layers :[CNN modelling](https://keras.io/api/layers/convolution_layers/convolution2d/)
- Fit the model using Model.fit()
- Evaluate your model 
- Plot your results
- Save your model 
We would love to see these results in Tensorboard along with the computation graph.

You can regularize the model as well. For reference: https://keras.io/api/callbacks/

In [None]:
#Creating a CNN
CNN = Sequential()

CNN.add(Input(shape=(75, 75, 3)))

#Specify a list of the number of filters for each convolutional layer

for n_filters in [16, 32, 64]:
    CNN.add(Conv2D(n_filters,strides=(2, 2), kernel_size=3, activation='relu'))

# Fill in the layer needed between our 2d convolutional layers and the dense layer
CNN.add(Flatten())

#Specify the number of nodes in the dense layer before the output
CNN.add(Dense(128, activation='relu'))

#Specify the output layer
CNN.add(Dense(2, activation='softmax'))
 
#Compiling the model
CNN.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')





**Plot model Diagram**

In [None]:
plot_model(ConnectionResetError, show_shapes=True, show_layer_names=False, dpi=60)

<b>Fit Model</b>

Let’s fit the model to the data using the generator. You can use `fit` as before but this time you will pass it generators rather than dataframes or numpy arrays.  

Because the data is being generated endlessly, the Keras model needs to know how many samples to draw from the generator before declaring an epoch over. This is the role of the `steps_per_epoch` argument: after having drawn steps_per_epoch batches from the generator—that is, after having run for steps_per_epoch gradient descent steps - the fitting process will go to the next epoch. 

When using `fit`, you can pass a validation_data argument. It’s important to note that this argument is allowed to be a data generator, but it could also be a tuple of Numpy arrays. If you pass a generator as validation_data, then this generator is expected to yield batches of validation data endlessly; thus you should also specify the validation_steps argument, which tells the process how many batches to draw from the validation generator for evaluation.

In [None]:
# Training the CNN model
history = CNN.fit(traingen,
        epochs=20,
        validation_data=valgen)

<b>Evaluate the Model</b>

In [None]:
CNN.evaluate(testgen)

Let’s plot the loss and accuracy of the model over the training and validation data during training:

<b>Plot the Training History</b>

Plot the training and validation accuracy and loss.

In [None]:
# Plotting the loss and accuracy plots
fig, ax = plt.subplots(1,2,figsize=(10,5))
ax[0].plot(history.history['loss'], label="Train Loss")
ax[0].plot(history.history['val_loss'], label="Val Loss")
ax[1].plot(history.history['accuracy'], label="Train Accuracy")
ax[1].plot(history.history['val_accuracy'], label="Val Accuracy")
ax[0].legend()
ax[1].legend()
ax[0].set_title("Loss Plot")
ax[1].set_title("Accuracy Plot")