## Asirra: The Dogs vs. Cats Dataset

* [Published](https://www.microsoft.com/en-us/research/publication/asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization/) in 2007 by Microsoft.
* This dataset appears in Chollet, Deep Learing with Python 2nd edition, chapter 8 (book and associated notebook). 

The dataset can be obtained from:
* [Dogs versus Cats](https://www.kaggle.com/c/dogs-vs-cats) - The Kaggle competition
* [The Dataset](https://www.kaggle.com/datasets/biaiscience/dogs-vs-cats) (with an Open Data license)

See the competition page for a nice introduction to the competition. Here is a description of the dataset taken from there:

> ### The Asirra data set
>Web services are often protected with a challenge that's supposed to be easy for people to solve, but difficult for computers. Such a challenge is often called a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart). CAPTCHAs are used for many purposes, such as to reduce email and blog spam and prevent brute-force attacks on web site passwords.

>Asirra (Animal Species Image Recognition for Restricting Access) is a CAPTCHA that works by asking users to identify photographs of cats and dogs. This task is difficult for computers, but studies have shown that people can accomplish it quickly and accurately. Many even think it's fun! Here is an example of the Asirra interface:

>Asirra is unique because of its partnership with Petfinder.com, the world's largest site devoted to finding homes for homeless pets. They've provided Microsoft Research with over three million images of cats and dogs, manually classified by people at thousands of animal shelters across the United States. Kaggle is fortunate to offer a subset of this data for fun and research. 

1. Download the dataset from Kaggle.
2. Extract `train.zip` into a folder of your choice.  You should have **25,000** images, of about 600MB. The folder could be part of the repository, e.g.  `./data/kaggle_dogs_vs_cats/` . By the way, if you save it in the `./data/` folder it won't be tracked with Git, given that this folder appears in the `.gitignore` file. This is preferred, since `Git` and `Github` are not intended as data repositories.
3. run the following code to create a small dataset of 1000 training images, 500 validation images and 1000 test images *per class*. This reduces the dataset from 25,000 images to 5000.
4. You can manually delete the larger dataset as well as the zip file.

In [2]:
import os, shutil, pathlib

original_dir = pathlib.Path("../data/kaggle_dogs_vs_cats/train")
new_base_dir = pathlib.Path("../data/kaggle_dogs_vs_cats_small")

def make_subset(subset_name, start_index, end_index):
    for category in ("cat", "dog"):
        dir = new_base_dir / subset_name / category
        os.makedirs(dir)
        fnames = [f"{category}.{i}.jpg" for i in range(start_index, end_index)]
        for fname in fnames:
            shutil.copyfile(src=original_dir / fname,
                            dst=dir / fname)

make_subset("train", start_index=0, end_index=1000)
make_subset("validation", start_index=1000, end_index=1500)
make_subset("test", start_index=1500, end_index=2500)

**To the Student**: Explore the data:
* Randmoly scroll through the images. Any notable insights from the perspective of training and testing? 
* What is the size in pixels of the images? 
* What color scheme (grayscale, RGB, other) are they coded in?

In [2]:
import os
import shutil
import random
from PIL import Image
import matplotlib.pyplot as plt

# Path to the original dataset
original_dataset_dir = r".\data\train\train"

# Base directory for the smaller dataset
base_dir = r".\data\cats_and_dogs_small"
os.makedirs(base_dir, exist_ok=True)

# Create subdirectories for train, validation, and test splits
splits = ['train', 'validation', 'test']
classes = ['cats', 'dogs']
split_sizes = {'train': 1000, 'validation': 500, 'test': 1000}

for split in splits:
    for category in classes:
        dir_path = os.path.join(base_dir, split, category)
        os.makedirs(dir_path, exist_ok=True)

# Split the dataset
def copy_images():
    for category in classes:
        fnames = [f for f in os.listdir(original_dataset_dir) if f.startswith(category[:-1])]
        random.shuffle(fnames)

        start = 0
        for split in splits:
            n = split_sizes[split]
            split_fnames = fnames[start:start + n]
            for fname in split_fnames:
                src = os.path.join(original_dataset_dir, fname)
                dst = os.path.join(base_dir, split, category, fname)
                shutil.copyfile(src, dst)
            start += n

# Only run once
copy_images()

# Data exploration
def explore_images(image_folder):
    sample_images = os.listdir(image_folder)[:5]
    for img_file in sample_images:
        img_path = os.path.join(image_folder, img_file)
        with Image.open(img_path) as img:
            print(f"Image: {img_file}")
            print(f" - Size: {img.size} (width x height)")
            print(f" - Mode: {img.mode} (Color scheme: RGB or Grayscale?)")
            img.show()

print("\n📊 Exploring a few cat images:")
explore_images(os.path.join(base_dir, 'train', 'cats'))

print("\n📊 Exploring a few dog images:")
explore_images(os.path.join(base_dir, 'train', 'dogs'))




📊 Exploring a few cat images:
Image: cat.0.jpg
 - Size: (500, 374) (width x height)
 - Mode: RGB (Color scheme: RGB or Grayscale?)
Image: cat.1.jpg
 - Size: (300, 280) (width x height)
 - Mode: RGB (Color scheme: RGB or Grayscale?)
Image: cat.1000.jpg
 - Size: (150, 149) (width x height)
 - Mode: RGB (Color scheme: RGB or Grayscale?)
Image: cat.10004.jpg
 - Size: (320, 239) (width x height)
 - Mode: RGB (Color scheme: RGB or Grayscale?)
Image: cat.10007.jpg
 - Size: (219, 300) (width x height)
 - Mode: RGB (Color scheme: RGB or Grayscale?)

📊 Exploring a few dog images:
Image: dog.0.jpg
 - Size: (499, 375) (width x height)
 - Mode: RGB (Color scheme: RGB or Grayscale?)
Image: dog.10000.jpg
 - Size: (263, 412) (width x height)
 - Mode: RGB (Color scheme: RGB or Grayscale?)
Image: dog.10012.jpg
 - Size: (500, 402) (width x height)
 - Mode: RGB (Color scheme: RGB or Grayscale?)
Image: dog.10014.jpg
 - Size: (399, 490) (width x height)
 - Mode: RGB (Color scheme: RGB or Grayscale?)
Image: