# Dataset loader helper
Making a dataset to use with the `cat_vs_dog` notebook

Format required:

```
dogcat/
    train/
        dog/
            image01.jpeg
            image02.jpeg
            ...
        cat/
            image04.jpeg
            image05.jpeg
            ...
    valid/
        dog/
            image101.jpeg
            image102.jpeg
            ...
        cat/
            image104.jpeg
            image105.jpeg
            ...
    test/
        image_01.jpeg
        image_02.jpeg
            ...
        
```

## Installing the dataset from Kaggle

Reference used : [Link](https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/)

In [21]:
! pip install kaggle



In [22]:
! mkdir ~/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [23]:
! cp kaggle.json ~/.kaggle/

cp: cannot stat 'kaggle.json': No such file or directory


In [24]:
! chmod 600 ~/.kaggle/kaggle.json

chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory


In [25]:
! kaggle datasets download -d arpitjain007/dog-vs-cat-fastai

Dataset URL: https://www.kaggle.com/datasets/arpitjain007/dog-vs-cat-fastai
License(s): unknown
Downloading dog-vs-cat-fastai.zip to /content
100% 1.60G/1.60G [01:13<00:00, 21.6MB/s]
100% 1.60G/1.60G [01:13<00:00, 23.5MB/s]


In [26]:
! unzip dog-vs-cat-fastai.zip

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: dogscats/train/dogs/dog.7066.jpg  
  inflating: dogscats/train/dogs/dog.7067.jpg  
  inflating: dogscats/train/dogs/dog.7068.jpg  
  inflating: dogscats/train/dogs/dog.7069.jpg  
  inflating: dogscats/train/dogs/dog.707.jpg  
  inflating: dogscats/train/dogs/dog.7070.jpg  
  inflating: dogscats/train/dogs/dog.7071.jpg  
  inflating: dogscats/train/dogs/dog.7072.jpg  
  inflating: dogscats/train/dogs/dog.7073.jpg  
  inflating: dogscats/train/dogs/dog.7074.jpg  
  inflating: dogscats/train/dogs/dog.7075.jpg  
  inflating: dogscats/train/dogs/dog.7076.jpg  
  inflating: dogscats/train/dogs/dog.7077.jpg  
  inflating: dogscats/train/dogs/dog.7078.jpg  
  inflating: dogscats/train/dogs/dog.7079.jpg  
  inflating: dogscats/train/dogs/dog.708.jpg  
  inflating: dogscats/train/dogs/dog.7080.jpg  
  inflating: dogscats/train/dogs/dog.7081.jpg  
  inflating: dogscats/train/dogs/dog.7082.jpg  
  inflating: dogscats/tra

## Putting the data in the required format

In [27]:
from pathlib import Path
data_dir=Path('../data')
data_dir

PosixPath('../data')

In [28]:
! mkdir ~/.data

mkdir: cannot create directory ‘/root/.data’: File exists


In [29]:
target_dir_name = f"../data/dogs_cats"
print(f"Creating directory: '{target_dir_name}'")

# Setup the directories
target_dir = Path(target_dir_name)

# Make the directories
target_dir.mkdir(parents=True, exist_ok=True)

Creating directory: '../data/dogs_cats'


In [30]:
import shutil

# Define the source and destination directories
source = Path('/content/dogscats')
target_dir = Path('/data/dogs_cats')

folder_list =['train','valid']
for i in folder_list:
# Copy the entire directory tree from source to destination
  source_dir=source/i
  destination_dir=target_dir/i
  # print(source_dir,destination_dir)

  try:
      shutil.copytree(source_dir, destination_dir)
      print(f"Successfully copied {source_dir} to {destination_dir}")
  except FileExistsError:
      print(f"Destination directory {destination_dir} already exists.")
  except Exception as e:
      print(f"An error occurred: {e}")


Successfully copied /content/dogscats/train to /data/dogs_cats/train
Successfully copied /content/dogscats/valid to /data/dogs_cats/valid


## Checking the directories

In [31]:
# Check lengths of directories
def walk_through_dir(dir_path):
  """
  Walks through dir_path returning its contents.
  Args:
    dir_path (str): target directory

  Returns:
    A print out of:
      number of subdiretories in dir_path
      number of images (files) in each subdirectory
      name of each subdirectory
  """
  import os
  for dirpath, dirnames, filenames in os.walk(dir_path):
    print(f"There are {len(dirnames)} directories and {len(filenames)} images in '{dirpath}'.")

walk_through_dir(target_dir)

There are 2 directories and 0 images in '/data/dogs_cats'.
There are 2 directories and 0 images in '/data/dogs_cats/train'.
There are 0 directories and 11500 images in '/data/dogs_cats/train/dogs'.
There are 0 directories and 11500 images in '/data/dogs_cats/train/cats'.
There are 2 directories and 0 images in '/data/dogs_cats/valid'.
There are 0 directories and 1000 images in '/data/dogs_cats/valid/dogs'.
There are 0 directories and 1000 images in '/data/dogs_cats/valid/cats'.


Looks like we've got about 11500 training images per class and 1000 validation images per class (This dataset also has about 12000 test images, which wont be needed in this case as for prediction, input will be later taken from user) .

## Zipping up images for easy transportation

In [32]:
zip_file_name = data_dir / f"cat_v_dog"
shutil.make_archive(zip_file_name,
                    format="zip",
                    root_dir=target_dir)

'/data/cat_v_dog.zip'

In [33]:
!ls -la ../data/

total 556780
drwxr-xr-x 4 root root      4096 Jun 19 00:23 .
drwxr-xr-x 1 root root      4096 Jun 19 00:15 ..
-rw-r--r-- 1 root root 570121167 Jun 19 00:23 cat_v_dog.zip
drwxr-xr-x 4 root root      4096 Jun 19 00:23 dogs_cats
drwx------ 5 root root      4096 Jun 19 00:19 drive


In [34]:
!ls ../data

cat_v_dog.zip  dogs_cats  drive


In [35]:
# Remove extra data

import os

try:
  shutil.rmtree("../data/dogs_cats")
  print("Removing directory ../data/dogs_cats")
except:
  print('Directory ../data/dogs_cats already removed')

try:
  shutil.rmtree("/content/dogscats")
  print("Removing directory /content/dogscats")
except:
  print('Directory /content/dogscats already removed')

try:
  os.remove("/content/dog-vs-cat-fastai.zip")
  print("Removing file /content/dog-vs-cat-fastai.zip")
except:
  print('file /content/dog-vs-cat-fastai.zip already removed')

try:
  os.remove("/content/kaggle.json")
  print("Removing file /content/kaggle.json")
except:
  print('file /content/kaggle.json already removed')

Removing directory ../data/dogs_cats
Removing directory /content/dogscats
Removing file /content/dog-vs-cat-fastai.zip
file /content/kaggle.json already removed


## Saving dataset file to local system (for backup)

In [36]:
# from google.colab import files
# files.download('/data/cat_v_dog.zip')

In [37]:
from google.colab import drive
drive.mount('/data/drive')

Drive already mounted at /data/drive; to attempt to forcibly remount, call drive.mount("/data/drive", force_remount=True).


In [38]:
shutil.copy('/data/cat_v_dog.zip','/data/drive/MyDrive')

'/data/drive/MyDrive/cat_v_dog.zip'