# Cats and Dogs statistics

**Checking the statistics of your data before training is not just a formalityâ€”it's a critical practice that informs how you prepare your dataset and interpret your model's performance.**

Understanding the distribution and balance of classes within your dataset is a fundamental step that can significantly impact the performance and evaluation of your classifier.

**Why Check Data Statistics?**

Class Distribution Awareness: Knowing how many samples belong to each class (e.g., Cats and Dogs) helps in understanding the dataset's overall structure. If one class significantly outnumbers another, it can lead to biased models that favor the majority class.

Ensuring Balanced Evaluation: Particularly in the test set, class imbalance can skew the evaluation metrics. For instance, if 90% of your test data are Dogs and only 10% are Cats, a model could achieve 90% accuracy by always predicting "Dog," without truly learning to distinguish between the two.

Adjusting Training Strategies: Awareness of class imbalance allows you to implement techniques like resampling, using class weights, or choosing appropriate evaluation metrics that account for imbalance (e.g., F1 score instead of accuracy).

**Importance of Class Balance in the Test Set**

Fair Performance Assessment: A balanced test set ensures that the model's performance is evaluated equally across all classes. This balance is crucial for a fair assessment of how well the model generalizes to unseen data.

Avoiding Misleading Metrics: Imbalanced test sets can lead to misleading performance metrics. A model might appear to perform well overall but fail to correctly predict the minority class.

**Data Preparation Steps**

To address potential class imbalance, especially when one class has more samples than another, we adjust the splitting percentages when creating the train and test sets:

**Calculate Class Totals**: Determine the total number of images for each class to understand the extent of imbalance.

**Adjust Test Set Proportion**: Use a higher percentage of the minority class (Cats) and a lower percentage of the majority class (Dogs) to create a test set with an equal number of samples from each class.

**Proportional Distribution Among Breeds**: Within each class, distribute the test samples proportionally among the different breeds. This approach ensures that the test set represents the diversity within each class.

**Create Balanced Datasets**: By carefully adjusting the number of test samples from each class, we create balanced train and test datasets that lead to more reliable and interpretable evaluation results.

In [1]:
import os
import math
import shutil

This function will check all available files and display the statistics of a dataset.

In our case filenames of cats are from capital letter and filenames of dogs are from small letter. The name of an animal is a breed follwed by a number. For example: `Abyssinian_99.jpg` - is a Cat of Abyssinian breed. `american_bulldog_5.jpg` - is a Dog of American Bulldog breed.

So our function will check all images files starting from capital letter and assign them to "Cats" and with small letter to "Dogs". It will spit the filename by "_" underscore symbol to separate number and keep the breed name to count all files with the same breed.

In [2]:
def count_images(folder):
    total_cats = 0
    total_dogs = 0
    cat_breeds = {}
    dog_breeds = {}
    allowed_extensions = ['.jpg', '.jpeg', '.png', '.gif']
    
    for filename in os.listdir(folder):
        # Full path
        file_path = os.path.join(folder, filename)
        # Check if it's a file with an allowed image extension
        if os.path.isfile(file_path) and os.path.splitext(filename)[1].lower() in allowed_extensions:
            # Get filename without extension
            filename_without_ext = os.path.splitext(filename)[0]
            # Split the filename to extract breed and number
            breed_parts = filename_without_ext.split('_')
            if len(breed_parts) < 2:
                # Skip files that don't follow the naming convention
                continue
            number_part = breed_parts[-1]
            breed_name = '_'.join(breed_parts[:-1])
            # Determine if it's a cat or dog based on the first character
            if filename[0].isupper():
                # Cat
                total_cats += 1
                cat_breeds[breed_name] = cat_breeds.get(breed_name, 0) + 1
            elif filename[0].islower():
                # Dog
                total_dogs += 1
                dog_breeds[breed_name] = dog_breeds.get(breed_name, 0) + 1
            else:
                # Not starting with a letter
                continue

    # Display statistics
    print(f"Total number of Cats: {total_cats}")
    print(f"Total number of Dogs: {total_dogs}")

    # Cat breed statistics
    print("\nCat Breed Statistics:")
    print("{:<40} {:<40}".format('Breed Name', 'Number of Images'))
    print("-" * 80)
    for breed, count in sorted(cat_breeds.items()):
        print("{:<40} {:<40}".format(breed, count))

    # Dog breed statistics
    print("\nDog Breed Statistics:")
    print("{:<40} {:<40}".format('Breed Name', 'Number of Images'))
    print("-" * 80)
    for breed, count in sorted(dog_breeds.items()):
        print("{:<40} {:<24}".format(breed, count))

This function will split the given images into train and test folders by given percentage.

In this case we find the class with the lowest number of samples (Cats) and take 10% of it by selecting equal percentage of each breed. Because we have more dogs (4990) than cats (2400) the split percentage for dogs dataset will be smaller to keep the same number of samples in each class of test set.

To keep test set statistics the same as the train set we will take the same percentage of each breed.

Function will copy files from `./image` folder to `./train_images` and `./test_images` according to split. Later we will use them for training and testing.

In [3]:
def count_and_split_images(folder, train_folder_output, test_folder_output, given_percentage=0.1):
    total_cats = 0
    total_dogs = 0
    cat_breeds = {}
    dog_breeds = {}
    allowed_extensions = ['.jpg', '.jpeg', '.png', '.gif']

    # Dictionaries to hold filenames for each breed
    cat_files = {}
    dog_files = {}

    for filename in os.listdir(folder):
        # Full path
        file_path = os.path.join(folder, filename)
        # Check if it's a file with an allowed image extension
        if os.path.isfile(file_path) and os.path.splitext(filename)[1].lower() in allowed_extensions:
            # Get filename without extension
            filename_without_ext = os.path.splitext(filename)[0]
            # Split the filename to extract breed and number
            breed_parts = filename_without_ext.split('_')
            if len(breed_parts) < 2:
                # Skip files that don't follow the naming convention
                continue
            number_part = breed_parts[-1]
            breed_name = '_'.join(breed_parts[:-1])
            # Determine if it's a cat or dog based on the first character
            if filename[0].isupper():
                # Cat
                total_cats += 1
                cat_breeds[breed_name] = cat_breeds.get(breed_name, 0) + 1
                cat_files.setdefault(breed_name, []).append(filename)
            elif filename[0].islower():
                # Dog
                total_dogs += 1
                dog_breeds[breed_name] = dog_breeds.get(breed_name, 0) + 1
                dog_files.setdefault(breed_name, []).append(filename)
            else:
                # Not starting with a letter
                continue

    # Display statistics
    print(f"Total number of Cats: {total_cats}")
    print(f"Total number of Dogs: {total_dogs}")

    # Cat breed statistics
    print("\nCat Breed Statistics:")
    print("{:<40} {:<40}".format('Breed Name', 'Number of Images'))
    print("-" * 80)
    for breed, count in sorted(cat_breeds.items()):
        print("{:<40} {:<40}".format(breed, count))

    # Dog breed statistics
    print("\nDog Breed Statistics:")
    print("{:<40} {:<40}".format('Breed Name', 'Number of Images'))
    print("-" * 40)
    for breed, count in sorted(dog_breeds.items()):
        print("{:<80} {:<80}".format(breed, count))

    # Create train and test directories if they don't exist
    os.makedirs(train_folder_output, exist_ok=True)
    os.makedirs(test_folder_output, exist_ok=True)

    # Determine which class has the maximum number of samples
    if total_cats >= total_dogs:
        max_class = 'cats'
        max_class_total_images = total_cats
        min_class = 'dogs'
        min_class_total_images = total_dogs
        max_class_files = cat_files
        min_class_files = dog_files
    else:
        max_class = 'dogs'
        max_class_total_images = total_dogs
        min_class = 'cats'
        min_class_total_images = total_cats
        max_class_files = dog_files
        min_class_files = cat_files

    # Calculate initial number of test images for the class with the maximum samples
    initial_max_class_test_images = int(given_percentage * max_class_total_images)

    # Determine the maximum number of test images per class
    max_test_images_per_class = min(initial_max_class_test_images, min_class_total_images)

    # Adjust percentages accordingly
    adjusted_given_percentage = max_test_images_per_class / max_class_total_images
    adjusted_min_class_percentage = max_test_images_per_class / min_class_total_images

    # Function to split and copy files
    def split_and_copy_files(breed_files, total_images_in_class, test_images_per_class, src_folder):
        test_images_per_breed = {}
        total_assigned = 0
        fractional_parts = []

        # First pass: calculate test images per breed, integer part
        for breed, files in breed_files.items():
            num_files_in_breed = len(files)
            proportion = (num_files_in_breed / total_images_in_class) * test_images_per_class
            test_images = int(proportion)
            test_images_per_breed[breed] = test_images
            total_assigned += test_images
            fractional_part = proportion - test_images
            fractional_parts.append((fractional_part, breed))

        difference = test_images_per_class - total_assigned

        # Adjust test_images_per_breed to match test_images_per_class
        if difference > 0:
            # Need to assign additional test images
            # Sort breeds by fractional parts in descending order
            fractional_parts.sort(reverse=True)
            for i in range(difference):
                breed = fractional_parts[i][1]
                test_images_per_breed[breed] += 1
        elif difference < 0:
            # Need to reduce test images
            # Sort breeds by fractional parts in ascending order
            fractional_parts.sort()
            for i in range(-difference):
                breed = fractional_parts[i][1]
                if test_images_per_breed[breed] > 0:
                    test_images_per_breed[breed] -= 1
                else:
                    # Can't reduce further, proceed to next breed
                    continue

        # Now, copy files per breed
        for breed, files in breed_files.items():
            files.sort()
            num_test = test_images_per_breed[breed]
            test_files = files[:num_test]
            train_files = files[num_test:]

            # Copy test files
            for file in test_files:
                src = os.path.join(src_folder, file)
                dst = os.path.join(test_folder_output, file)
                shutil.copy(src, dst)

            # Copy train files
            for file in train_files:
                src = os.path.join(src_folder, file)
                dst = os.path.join(train_folder_output, file)
                shutil.copy(src, dst)

    # Split and copy files for both classes
    split_and_copy_files(max_class_files, max_class_total_images, max_test_images_per_class, folder)
    split_and_copy_files(min_class_files, min_class_total_images, max_test_images_per_class, folder)

    print("\nTrain and test datasets have been created.")
    print(f"Train images are in: {train_folder_output}")
    print(f"Test images are in: {test_folder_output}")

Display statistics first. Here we can see that the number of samples for each breed is almost the same, so breed statistics is balanced.

In [4]:
count_images('./images')

Total number of Cats: 2400
Total number of Dogs: 4990

Cat Breed Statistics:
Breed Name                               Number of Images                        
--------------------------------------------------------------------------------
Abyssinian                               200                                     
Bengal                                   200                                     
Birman                                   200                                     
Bombay                                   200                                     
British_Shorthair                        200                                     
Egyptian_Mau                             200                                     
Maine_Coon                               200                                     
Persian                                  200                                     
Ragdoll                                  200                                     
Russian_Blue          

Now let us split the images into training and testing sets.

In [5]:
count_and_split_images('./images', './train_images', './test_images', given_percentage=0.1)

Total number of Cats: 2400
Total number of Dogs: 4990

Cat Breed Statistics:
Breed Name                               Number of Images                        
--------------------------------------------------------------------------------
Abyssinian                               200                                     
Bengal                                   200                                     
Birman                                   200                                     
Bombay                                   200                                     
British_Shorthair                        200                                     
Egyptian_Mau                             200                                     
Maine_Coon                               200                                     
Persian                                  200                                     
Ragdoll                                  200                                     
Russian_Blue          

Now we can check each set statistics to make sure everything was Ok.

In [6]:
count_images('./train_images')

Total number of Cats: 1901
Total number of Dogs: 4491

Cat Breed Statistics:
Breed Name                               Number of Images                        
--------------------------------------------------------------------------------
Abyssinian                               159                                     
Bengal                                   159                                     
Birman                                   159                                     
Bombay                                   159                                     
British_Shorthair                        159                                     
Egyptian_Mau                             158                                     
Maine_Coon                               158                                     
Persian                                  158                                     
Ragdoll                                  158                                     
Russian_Blue          

In [7]:
count_images('./test_images')

Total number of Cats: 499
Total number of Dogs: 499

Cat Breed Statistics:
Breed Name                               Number of Images                        
--------------------------------------------------------------------------------
Abyssinian                               41                                      
Bengal                                   41                                      
Birman                                   41                                      
Bombay                                   41                                      
British_Shorthair                        41                                      
Egyptian_Mau                             42                                      
Maine_Coon                               42                                      
Persian                                  42                                      
Ragdoll                                  42                                      
Russian_Blue            

After copying files to training and testing datasets. We can move on to training.