<a href="https://colab.research.google.com/github/MahdiJahanbakht/cat-dog-classification-on-colab/blob/master/2_Make_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Make A Decent Dataset for our work

In this project we use the classic **Cat-Dog Classification** data, and In this notebook, I will show how to load images and use them to train our neural network. Fortunately the dataset is available through Kaggle API.
We'll use this dataset to train a neural network that can differentiate between cats and dogs.

Since we are going to use pytorch to train our network, the easiest way to load image data is with **datasets.ImageFolder** from **[torchvision](http://pytorch.org/docs/master/torchvision/datasets.html#imagefolder)**. 
We use ImageFolder like this:
`dataset = datasets.ImageFolder('path/to/data', transform=transform)`
where `'path/to/data'` is the file path to the data directory and transform is a sequence of processing steps built with the **transforms** module from torchvision. The ImageFolder methos expects the training and test files being arranged in appropriate directories like this:

```
root/train/class name/xxx.png
root/test/class name/xxx.png
```
So, in our example we should have some thing like this:
```
root/test/cat/xxx.png
root/train/dog/xxx.png

root/test/cat/xxx.png
root/test/dog/xxx.png
```
As it is illustrated above, each class has it's own directory (cat and dog) for the images. The images are then labeled with the class taken from the directory name. Unfotunately, the original dataset from Kaggle is not foremd like this. So, we have to do this manually. In this note book I've do so.
First, we have to mount our **google drive**

If it is your first time to run this code, you have to do as the following:
* First run the block
* By doing so, it will appear a link in the output. Click on it
* Then, choose your account and in the next window allow **Google Drive File Stream** to do what ever it wants :)
* In the next page copy the verification code provided for you in the box that has been appeared below the aforementioned link in the output and press **Enter**
This would do the trick and mount your drive at `/content/drive`.

### Mount Google Drive

In [0]:
# Moun Google Drive
from google.colab import drive
drive.mount('/content/drive')

Secon, we download the **biaiscience/dogs-vs-cats** dataset in a folder inside our drive. The argument `path` does so.
`/content/drive/My Drive/` is our mounted drive's root directory and `Deep_Learning` is our desired folder to save downloaded dataset into it.

### Download data

In [0]:
# kaggle datasets download -d biaiscience/dogs-vs-cats

import kaggle

kaggle.api.authenticate()

kaggle.api.dataset_download_files('biaiscience/dogs-vs-cats', path='/content/drive/My Drive/Deep_Learning/', unzip=False)

The structure of the zip file is like
```
dogs-vs-cats.zip
  - train.zip
  - test1.zip
  - sampleSubmission.csv
```
Inside `train.zip` there are many files like: `cat.xxxx.jpg` or  `dog.xxxx.jpg` and we are going to reshape this into:
```
Cat_Dog_data
  -train
    -cat
      cat.xxxx.jpg
    -dog
      dog.xxxx.jpg
  -test
    -cat
      cat.xxxx.jpg
    -dog
      dog.xxxx.jpg
```
The other zip file is `test1.zip` that contains files like: `xxxx.jpg` that does't show any classes that is of no use for us. So, we simply ignore that and construct our test set from training data manually.
By running next block it unzips our desired data from downloaded dataset.

### Unzip dataset
>**Caution:** Google Drive operations can** time out** when the** number of files or subfolders in a folder grows too large**. If thousands of items are directly contained in the top-level** "My Drive"** folder then mounting the drive will likely time out. Repeated attempts may eventually succeed as failed attempts cache partial state locally before timing out. If you encounter this problem, try moving files and folders directly contained in **"My Drive"** into sub-folders. A similar problem can occur when reading from other folders after a successful `drive.mount()`. Accessing items in any folder containing many items can cause errors like `OSError: [Errno 5] Input/output error`. Again, you can fix this problem by moving directly contained items into sub-folders.  
**Note** that "deleting" files or subfolders by moving them to the Trash** may not be enough**; if that doesn't seem to help, make sure to also** Empty your Trash**.[Link](https://research.google.com/colaboratory/faq.html#drive-timeout)

In [0]:
import zipfile
import os
import shutil

base_dir = '/content/drive/My Drive/Deep_Learning/Datasets/'

source_dir = base_dir + 'dogs-vs-cats'
dest_dir = base_dir + 'Cat_Dog_data/'

# Unzip downloaded dataset in proper subdirectories
zipfile.ZipFile(source_dir + '.zip').extractall(source_dir)
# zipfile.ZipFile(source_dir + '/test.zip').extractall(source_dir + '/test')
zipfile.ZipFile(source_dir + '/train.zip').extractall(source_dir + '/train')

The `make_dataset(source_dir, dest_dir, test_portion=0.2)` function does the desired work. `tets_portion` argument indicates the portion of whole dataset to be used as test data. We assume there are even number of cat/dog images in the traning data, and make test data by gathering first `(test_portion/[number of classes] * [size of train data])` images from each class. By doing so we have a test dataset with the size of  `test_portion` percent of whole traning data, while keeping the balance between each classes simultaneously.


### Structure data in the desired format

In [0]:
def make_dataset(source_dir, dest_dir, test_portion=0.2):
    source_dir = source_dir + '/train/'

    # destination dirs dictionary
    dirs = {'train_cats': 'train/cat/',
            'train_dogs': 'train/dog/',
            'test_cats': 'test/cat/',
            'test_dogs': 'test/dog/'}
    # create dest folders
    for addr in dirs.values():
        dest = dest_dir + addr
        os.makedirs(dest, exist_ok=True)
    
    files = os.listdir(source_dir)
    test_size = int(len(files) * test_portion/2)  # There are almost equal number of cat and dog images

    # move and replace if exists
    cat_cnt = 0
    dog_cnt = 0
    for f in files:
        if f.startswith("cat") or f.startswith("Cat"):
            if cat_cnt <= test_size:
                cat_cnt += 1
                shutil.move(source_dir + f, dest_dir + dirs.get('test_cats') + f)
            else:
                shutil.move(source_dir + f, dest_dir + dirs.get('train_cats') + f)

        if f.startswith("dog") or f.startswith("Dog"):
            if dog_cnt <= test_size:
                dog_cnt += 1
                shutil.move(source_dir + f, dest_dir + dirs.get('test_dogs') + f)
            else:
                shutil.move(source_dir + f, dest_dir + dirs.get('train_dogs') + f)
                
make_dataset(source_dir,dest_dir,0.2)

###Garbage Collection
After creating our dataset, it is time to get rid of space-consuming now useless data :)

In [0]:
shutil.rmtree(source_dir)
os.remove(source_dir + '.zip')