# Dogs Vs. Cats: Training a Covnet from Scratch

In this notebook, I will explore the process of creating an image classification convolutional neural network (covnet) from scratch. 

Let's build the model with a series of `Conv2D` and `MaxPooling2D` layers. 

The data set used is the [dogs vs cats](https://www.kaggle.com/c/dogs-vs-cats) dataset from the 2013 Kaggle competiton. While the training dataset contains 25,000 training images, let's start with only a small fraction of that to save on computing costs for now. 

In [2]:
import os
import shutil
import pathlib
from tensorflow import keras
import numpy as np 
import matplotlib.pyplot as plt 

## Preparing the Dataset

Frist, download the data set from Kaggle using Terminal

```
kaggle competitions download -c dogs-vs-cats
unzip dogs-vs-cats.zip -d dogs-vs-cats-data
cd dogs-vs-cats-data
unzip -q train.zip
unzip -q test1.zip
rm train.zip
rm test1.zip
cd ..
rm dogs-vs-cats.zip
```

Reminder: Make sure to accept the competiton rules before downloading. Otherwise, you will get a `403 - Forbidden` error. 

### Preview some of the images

To get an idea of the images we are working with, lets preview some of the images. 

### Dogs
![Dog](dogs-vs-cats-data/train/dog.0.jpg)
![Dog](dogs-vs-cats-data/train/dog.1.jpg)
![Dog](dogs-vs-cats-data/train/dog.2.jpg)
![Dog](dogs-vs-cats-data/train/dog.3.jpg)

### Cats
![Cat](dogs-vs-cats-data/train/cat.0.jpg)
![Cat](dogs-vs-cats-data/train/cat.1.jpg)
![Cat](dogs-vs-cats-data/train/cat.2.jpg)
![Cat](dogs-vs-cats-data/train/cat.3.jpg)



By previewing the images, we can see that the images are in color, but they are not all the same size. This tells us that we will need a way to handle the different sizes before feeding it to our model. 

### Make a small dataset

Instead of the full 25,000 training images, let's use a small subset of that to practice on. 

This will also allow us to create separate training, validation, and testing sets. The original test data set does not have labels, so lets use the images from the training data set. 

Also, unfortuantly since the Kaggle competion is over, and there is no way to submit solutions, the test dataset is useless to us now, as there is no way to evaluate its accuracy (outside of manually labeling the images.)

In [10]:
# Our full data set is in the original data folder. Let's make a new small folder.
original_dir = pathlib.Path('dogs-vs-cats-data')
new_base_dir = pathlib.Path('dogs-vs-cats-small')

# Define a function to move our images to the new directory.
def make_subset(subset_name, start_index, end_index):
    for category in ('cat', 'dog'):
        dir = new_base_dir / subset_name / category
        pathlib.Path(dir).mkdir(parents=True, exist_ok=True)
        # images in the train set have the file name dog.5.jpg, with no leading zeros. 
        fnames = [f'{category}.{i}.jpg' for i in range(start_index, end_index)]
        for fname in fnames:
            # Copy from original_dir to new_base_dir
            shutil.copyfile(src=original_dir / "train"/ fname, dst=dir / fname)

# Define size of each set
# Select size/2 of each type of image to chose for the small subset. 
num_train_images = 2000
num_val_images = 1000
num_test_images = 2000

# Loop over the train, validation, and test sets, making subsets for each.
i = 0
for dataset, num_images in {"train":num_train_images, "validation":num_val_images, "test":num_test_images}.items():
    make_subset(dataset, i, int(i + num_images/2))
    i += int(num_images / 2)